Regardless of the technology being used, most system for the analysis and indexing of documents for search and information retrieval follow the same basic procedure. First the data are separated into individual documents and each document is divided into text tokens. These tokens are then combined into meaningful phrases and fragments that are indexed for retrieval. An index contains data that is used for search and document analysis to process queries and identify relevant objects. After the index is constructed, queries may be submitted to the search system. The query represents information that is desired by the user, and is expressed using a query language and syntax defined by the search system. The search system processes the query using the index data for the database and a suitable similarity ranking algorithm. From this, the system returns a list of topically relevant objects, often referred to as a “hit-list”. The user may then select relevant objects from the hit-list for viewing and processing.
In a network environment, the components of a text search system may be distributed across multiple computers. A network environment contains two or more computers connected by a local or a wide area network, (e.g., Ethernet, Token Ring, the telephone network, and the Internet). A user accesses a hypermedia object database using a client application on the user's computer. The client application communicates with a search server (e.g., a hypermedia object database search system) on either the computer (e.g., the client) or another computer (e.g., one or more servers) on the network. To process queries, the search server needs to access just the database index, which may be located on the same computer as the search server or on another computer on the network. The actual objects in the database may be located on any computer on the network.
A Web environment, such as the World Wide Web on the Internet, is a network environment where Web servers and browsers are used. Having gathered and indexed all of the documents available in the collection, the index can then be used, as described above, to search for documents in the collection. Again, the index may be located independently of the objects, the client, and even the search server. A hit-list, generated as the result of searching the index, will typically identify the locations and titles of the relevant documents in the collection, and the user then retrieves those documents directly using the user's Web browser.
Text mining of documents can also be performed as part of document indexing. Text mining involves the recognition of document parts, such as paragraphs and sentences, and then the analysis of each recognized document part (e.g., each sentence). Sentence analysis involves the tagging of each word with its part of speech and then the parsing of each sentence into its component parts. The result of sentence parsing is a parse tree of the parts and sub-parts of that sentence. This information is typically stored in tables for retrieval. Frequently these tables are database tables with database indexes associated with them.
Such parsing and data storage can then be used to deduce the overall meaning of the document and the relations between parts of the document.
Of particular concern to this invention is the above-described sentence parsing operation, in the context of documents that contain the names of organic chemicals. Organic chemical names can be made up of very long strings of words, punctuation and spaces which need to be grouped so that they can be recognized as single noun phrases, rather than as a series of unknown words.
Organic chemical terms can be lengthy, complex, and may consist of several words separated by spaces. Ideally, an organic chemical term should be recognized as a single noun phrase for the parsing of sentences in technical documents to proceed effectively. For example, terms such chloroacetic acid, 4-allyl-2,6-dimethylphenol, 5-aminoalkyl-pyrazolo [4,3-D]-pyrimidine and 4-nitrobenzyl chloroformate each present specific term recognition challenges. A prior art approach to solving this recognition problem would be to provide, maintain and reference a very large chemical dictionary to identify the presence of organic chemical terms appearing as part of a document text.
Further, while there exist specific rules for the spelling, spacing and punctuation of such chemical terms, these rules are not always rigorously followed, especially in the patent literature. Examples abound of chemical names broken up by incorrect spaces or hyphens which must be recombined for the overall term to be recognized successfully.
Wilbur, et. al. (W. J Wilbur, G. F. Hazard, G. Divita, J. G. Mork, A. R. Aronson and A. C. Browne, “Analysis of biomedical text for chemical names: a comparison of three methods,” in Proc. AMIA Symp. 1999, Washington, 1999) described three algorithms for the discovery of chemical names in biomedical text. The first is an analysis of the structure of chemical names into a set of chemical morphemes, and then the combination of these morphemes into chemical names. The other two methods are variations on a Bayesian classifier based on overlapping n-grams. These methods were tested, however, on well edited text, and it is thus not clear how they would perform in light of errors. In addition, Wilbur, et al. specifically note that they only recognized chemical names found in the MeSH ontology, and that names containing punctuation characters would not work well with their algorithms.
M. Narayanaswamy, E. Ravikumar and K. Vijay-Shaker, “A Biological Named Entity Recognizer,” Proceedings of the Pacific Symposium on Biocomputing, January, 2003 disclosed a system for recognizing a small set of chemical phrases that may be part of common biological abbreviations, but did not extend their procedure to the general case of interest to this invention.
Prior to this invention, there existed no satisfactory document search and text mining apparatus or methods for dealing with documents containing chemical names, such as the names of organic chemicals.