Regardless of the technology being used, most system for the analysis and indexing of documents for search and information retrieval follow the same basic procedure. First the data are separated into individual documents and each document is divided into text tokens. These tokens are then combined into meaningful phrases and fragments that are indexed for retrieval. An index contains data that is used for search and document analysis to process queries and identify relevant objects.
After the index is constructed, queries may be submitted to the search system. The query represents information that is desired by the user, and is expressed using a query language and syntax defined by the search system. The search system processes the query using the index data for the database and a suitable similarity ranking algorithm. From this, the system returns a list of topically relevant objects, often referred to as a “hit-list”. The user may then select relevant objects from the hit-list for viewing and processing.
In a network environment, the components of a text search system may be distributed across multiple computers. A network environment contains two or more computers connected by a local or a wide area network, (e.g., Ethernet, Token Ring, the telephone network, and the Internet). A user accesses a hypermedia object database using a client application on the user's computer. The client application communicates with a search server (e.g., a hypermedia object database search system) on either the computer (e.g., the client) or another computer (e.g., one or more servers) on the network. To process queries, the search server needs to access just the database index, which may be located on the same computer as the search server or on another computer on the network. The actual objects in the database may be located on any computer on the network.
A Web environment, such as the World Wide Web on the Internet, is a network environment where Web servers and browsers are used. Having gathered and indexed all of the documents available in the collection, the index can then be used, as described above, to search for documents in the collection. Again, the index may be located independently of the objects, the client, and even the search server. A hit-list, generated as the result of searching the index, will typically identify the locations and titles of the relevant documents in the collection, and the user then retrieves those documents directly using the user's Web browser.
Text mining of documents can also be performed as part of document indexing. Text mining involves the recognition of document parts, such as paragraphs and sentences, and then the analysis of each recognized document part (e.g., each sentence). Sentence analysis involves the tagging of each word with its part of speech and then the parsing of each sentence into its component parts. The result of sentence parsing is a parse tree of the parts and sub-parts of that sentence. This information is typically stored in tables for retrieval. Frequently these tables are database tables with database indexes associated with them.
Such parsing and data storage can then be used to deduce the overall meaning of the document and the relations between parts of the document.
The ability to search patent and patent-related literature for information related to chemical entities is particularly challenging. The nomenclature associated with chemical substances is difficult to understand, and often inconsistent chemical terms are used to express the same or similar chemical entities. Despite attempts to standardize the chemical nomenclature by international standards committees such as the Union of Pure and Applied Chemist (IUPAC), these rules unfortunately have not been consistently applied to chemical substances over time, particularly with respect to the patent literature.
Historically, chemical entities were often referred to by “common names” and/or by inconsistently applied IUPAC rules. Often, terms that were acceptable in earlier years (for example ‘potash’) later gave way to other standards (potassium carbonate). Little or no effort has been made to “normalize” the chemical nomenclature of the intellectual property (IP) databases retroactively over the decades.
The problem of inconsistent naming is exemplified by considering the chemical names that have been applied to the drug VALIUM® (Valium is a registered trademark of Roche Products Inc.), the chemical structure of which is shown in FIG. 1. A list of some of the correct and incorrect names for VALIUM® that are found in the chemical and patent literature are shown in Table 1.
Table 1—Some of the Chemical Names Used for Valium® in Different Databases
    7-chloro-1-methyl-5-phenyl-2H-1,4-benzodiazepin-2-one    7-chloro-1-methyl-5-phenyl-3H-1,4-benzodiazepin-2(1H)-one    7-chloro-1-methyl-5-phenyl-1,3-dihydro-2H-1,4-benzodiazepin-2-one    7-chloro-1-methyl-2-oxo-5-phenyl-3H-1,4-benzodiazepine    1-methyl-5-phenyl-7-chloro-1,3-diydro-2H-1,4-benzodiazepin-2-one    7-chloro-1,4-dihydro-1-methyl-5-phenyl-2H-1,4-benzodiazepin-2-one    7-chloro-1-methyl-5-3H-1,4-benzodiazepin-2(1H)-one
Additionally, in the case of pharmaceuticals, the names of compounds of interest often change over time as compounds become commercialized. This has led to the frequent use of trade names or generic names in the scientific literature or in medical databases, which are not reflected retrospectively in the various IP databases. This has made it difficult to perform text searching for certain pharmaceuticals in the patent literature using commonly accepted phrases or definitions. For example, one cannot simply type in the search term “aspirin” or “VALIUM®” into any of the IP databases and find the pertinent patents for those chemical substances. The problem is further exacerbated by the fact that different brand names are often used in different countries to address language considerations of the different geographical areas. In fact, there are as many as 149 different names that have been employed in the literature for the drug VALIUM®, a number of which are illustrated in Table 2.
Table 2 - Some of the trade names used to refer to VALIUM®
ALBORAL, ALISEUM, ALUPRAM, AMIPROL, ANSIOLIN, ANSIOLISINA, APAURIN, APOZEPAM, ASSIVAL, ATENSINE, ATILEN, BIALZEPAM, CALMOCITENE, CALMPOSE, CERCINE, CEREGULART, CONDITION, DAP, DIACEPAN, DIAPAM, DIAZEMULS, DIAZEPAM, DIAZETARD, DIENPAX, DIPAM, DIPEZONA, DOMALIUM, DUKSEN, DUXEN, E-PAM, ERIDAN, EVACALM, FAUSTAN, FREUDAL, FRUSTAN, GIHITAN, HORIZON, KIATRIUM, LA-III, LEMBROL, LEVIUM, LIBERETAS, METHYL DIAZEPINONE, MOROSAN, NEUROLYTRIL, NOAN, NSC-77518, PACITRAN, PARANTEN, PAXATE, PAXEL, PLIDAN, QUETINIL, QUIATRIL, QUIEVITA, RELAMINAL, RELANIUM, RELAX, RENBORIN, RO 5-2807, S.A.R.L., SAROMET, SEDAPAM, SEDIPAM, SEDUKSEN, SEDUXEN, SERENACK, SERENAMIN, SERENZIN, SETONIL, SIBAZON, SONACON, STESOLID, STESOLIN, TENSOPAM, TRANIMUL, TRANQDYN, TRANQUASE, TRANQUIRIT, TRANQUO-TABLINEN, UMBRIUM, UNISEDIL, USEMPAX AP, VALEO, VALITRAN, VALRELEASE, VATRAN, VELIUM, VFVAL, VIVOL, WY-3467
Additionally, many chemical and drug patents make use of Markush structure references. These structures are generalized references to chemical structures where some substituent groups are specified in general terms, and a list of possible substitutents is enumerated. Thus, rather than a specific chemical compound being named, the Markush convention allows claimants to describe an entire series of compounds even if they have not specifically be synthesized or tested.
For example, and referring to FIG. 2, rather than representing toluene (methylbenzene) as C6H5—CH3, the Markush formulation allows one to represent an entire series of substituted benzenes as C6H5—R, where R is, by convention, any of a large number of carbon chains of various sizes. This convention further increases the difficulty of locating a chemical compound by normal searching techniques.
In U.S. Pat. No.: 6,304,869, Moore et al. describe a system to assign sub-structures to fragments given a complete structure connectivity description of a molecule, as well as a relational database system for storing this information. However, there is no concept of finding structures or substructures from names.