The present invention relates to statistical natural language processing, including information retrieval, information extraction, and literature-based discovery. More specifically, the present invention relates to obtaining a collection of documents that is classified or categorized by a taxonomic system identified ex ante by the user, and to using such a classified collection to obtain useful information.
Access to the right information is invaluable in the development of new ideas and business opportunities, in supporting research and investigation on virtually any subject, and generally for making good decisions and alerting decision makers to conditions that require decision making.
A large portion of the important information that is needed for decision making and that is stored electronically (and suitable for processing by digital computers) is in the form of text in documents. It is generally recognized that the current state-of-the-art in statistical natural language processing (which is the covering discipline and art for accessing information in text documents) does not support fully adequate access to information in texts.
Categorized document collections, in which individual documents (or texts) in a collection of documents are assigned to categories in taxonomic systems, are widely recognized and widely used for improving effective information access to collections of texts. Among other values, categorization serves to focus a decision maker's or investigator's attention on smaller subsets of larger collections, thereby facilitating search and retrieval. Also, the distribution of documents across categories in a taxonomic system may itself be useful information for decision making and investigation.
Examples of categorized document bases include library catalogs based on classification schemes such as the Library of Congress classification and the Dewey Decimal classification, and subject classifications such as the United States Patent Classification and the International Patent Classification. Such classification schemes conventionally require a human being to examine a book or other document, and make a decision as to what class or classes to assign to the document.
It has been proposed to classify documents, for example, documents gathered from the Internet, automatically by searching the text of the document for terms found in classification codes of an existing document classification and for terms found in an existing thesaurus to that classification. However, because the promoters of these proposals are librarians and library scientists, these proposals are typically confined to generating a library-style classified index in which each document is assigned to one or a few classification codes, and can be retrieved by searching the index under that code or one of those codes. Library indexes, even current computerized library indexes, are typically limited to a search in a single index, or a search for the Boolean intersection of two or more unrelated indexes, (for example, classification AND author), returning a single list of “hits.”
There is therefore a continuing need for methods and systems that can provide more information about documents than merely assigning the document to a class within a taxonomy.