The ability to effectively retrieve information on generic chemical structures, i.e., so-called Markush structures, has been a problem of varying magnitude and complexity since the inception of the use of the Markush claim by the Patent Office in the 1920's. Many manual and mechanized information retrieval systems have been developed to meet the challenge of this problem but the known techniques for such retrieval are imprecise and often place a premium on the knowledge, intuition, and cognitive skills of the searcher.
The basic system for dealing with Markush structures is a manual system in which individual documents containing the Markush structure are classified according to a highly refined classification system and physically grouped according to the classification scheme into a search file. In making a search, the searcher proceeds by classifying the document (query) in hand and then goes to the appropriately classified physical group of documents in the search file and manually searches those documents for relevant retrievals. Such a system places a high premium on the correct initial classification of search file documents, correct classification of the query, physical search-file integrity, and highly-developed cognitive skills of the searcher. Moreover, because the Markush may represent thousands or even millions of compounds, it often is impossible to promulgate copies of the document into all of the search file classifications represented by the Markush formulation. Weaknesses in any of the aforementioned areas is likely to produce unsatisfactory search results. (U.S. Department of Commerce, "Development and Use of Patent Classification Systems", U.S. Government Printing Office, Washington, D.C., 1966.)
Another technique used in both manual and mechanized systems for the handling of Markush structures involves the use of a system of fragmentation codes that are in effect generic or real-atom "group" representations of portions of a particular Markush formulation. For example, that portion of the formulation containing chains of carbon atoms might be generically encoded as alkyl, or OH group as an alcohol or hydroxide, and F, Cl, Br, and I as a halide. Real-atom groups, such as methyl for CH.sub.3 13 , ethyl for CH.sub.3 CH.sub.2 --, and phenyl for C.sub.6 H.sub.5 --, are also typically used. (Balent, M. Z.; Emberger, J. M. "A Unique Chemical Fragmentation System for Indexing Patent Literature" J. Chem. Inf. Comput. Sci. 1975, 15, 100-104. Kaback, S. M. "Chemical Structure Searching in Derwent's World Patents Index" J. Chem. Inf. Comput. Sci. 1980, 20, 1-6. Rossler, S.; Kolb, A. "The GREMAS System, an Integral Part of the IDC System for Chemical Documentation" J. Chem. Doc. 1970, 10, 128-134. Rowlett, R. J. "Gleaning Patents with Chemical Abstracts" Chemtec. 1979, June, 348-349. Silk, J. A. "Present and Future Prospects for Structural Searching of the Journal and Patent Literature." J. Chem. Inf. Comput. Sci. 1979, 19, 195-198.) However, the inter-relationships among these groups in a Markush formulation are typically not encoded. As a result, such systems tend to have good recall, i.e., most of the relevant search file answers are retrieved but, because the inter-relationship among the groups can not be specified and the reliance on generic terminology, such systems have a pronounced tendency to lack precision, i.e., many of the answers retrieved are irrelevant to the query. Precision has been improved by incorporation of a higher degree of specificity into the fragmentation codes, but only at a price paid in terms of higher complexity and difficulty in file encoding and search profile formulation and a resulting higher potential for error.
Mechanized specific atom-by-atom structure matching of query and file structural representations is a well-known commercial technique that has been available since the 1960s and has demonstrated high recall and precision as a search and retrieval technique. (Wigington, R. L. "Machine Methods for Accessing Chemical Abstracts Service Information in Proceedings of the IBM Symposium on Computers and Chemistry"; IBM Data Processing Division: White Plains, NY, 1969. Eakin, D. R. "The ICI CROSSBOW System," in Ash, J. E.; Hyde, E., Eds. Chemical Information Systems, Chichester, Horwood, 1975. Dubois, J. E. "DARC System in Chemistry", in Computer Representation and Manipulation of Chemical Information, Wipke, W. T.; Heller, S.; Feldman, R.; Hyde, E., Eds., Wiley, New York, 1974. Schenk, H. R.; Wegmuller, F. "Substructure Search by Means of the Chemical Abstracts Service Chemical Registry II System" J. Chem. Inf. Comput. Sci. 1976, 16, 153-161. Feldman, R. J. "Interactive Graphic Chemical Substructure Searching" in Computer Representation and Manipulation of Chemical Information, Wipke, W. T.; Heller, S.; Feldman, R.; Hyde, E., Eds., Wiley, New York, 1974.) Because atom-by-atom structure matching is a relatively slow process, screening techniques have been developed to eliminate a high percentage of irrelevant file representations. Typically screening involves capturing key features of the file representations such as atom environment and atom sequences and then matching similar key features of the query representation to give a set of answers that are then used in atom-by-atom structure matching. (Dittmar, P. G.; Farmer, N. A.; Fisanick, W.; Haines, R. C.; Mockus, J. "The CAS ONLINE Search System. 1. General System Design and Selection, Generation, and Use of Search Screens" J. Chem. Inf. Comput. Sci. 1983, 23, 93-102. Attias, R. "DARC Substructure Search System: A New Approach to Chemical Information" J. Chem. Inf. Comput. Sci. 1983, 23, 102-108.) Unfortunately, structure matching techniques tend to be limited to files containing representations of unique individual compounds and queries have been limited to specific structural representations that must exactly match the structural representation of the file compound (full-structure search) or be embedded within it (substructure search). Structure matching techniques have been applied to Markush formulations which represent a relatively small number of specific compounds using queries that contain only real atoms. (Meyer, E. "Topological Search for Classes of Compounds in Large Files--even of Markush Formulas--at Reasonable Machine Cost" in Computer Representation and Manipulation of Chemical Information, Wipke, W. T.; Heller, S.; Feldman, R.; Hyde, E., Eds., Wiley, New York, 1974.) However, in attempting to apply structure matching techniques to query and file structures represented by Markush formulations of the type often found in broad patent claims, one is immediately faced with the problem that a single Markush formulation may literally represent millions of specific compounds. When one considers that the file size of the current large commercial structural matching systems is a little less than seven million specific compounds, an appreciation is gained for the difficulty in using structure matching techniques to search effectively Markush structures. Although proposals have been made to apply structure matching techniques to broad Markush formulations, no viable system for searching such Markush formulations that gives a high degree of recall and precision has yet been achieved. (Lynch, M. F.; Bernard, J. M.; Welford, S. M. "Computer Storage and Retrieval of Generic Chemical Structures in Patents. 1. Introduction and General Strategy" J. Chem. Inf. Comput. Sci. 1981, 21, 148-150. Barnard, J. M.; Lynch, M. F.; Welford, S. M. "Computer Storage and Retrieval of Generic Chemical Structures in Patents. 2. GENSAL, a Formal Language for the Description of Generic Chemical Structures" J. Chem. Inf. Comput. Sci. 1981, 21, 151-161. Welford, S. M.; Lynch, M. F.; Barnard, J. M. "Computer Storage and Retrieval of Generic Chemical Structures in Patents. 3. Chemical Grammars and their Role in the Manipulation of Chemical Structures" J. Chem. Inf. Comput. Sci. 1981, 21, 161-168. Barnard, J. M.; Lynch, M. F.; Welford, S. M. "Computer Storage and Retrieval of Generic Chemical Structures in Patents. 4. An Extended Connection Table Representation (ECTR) for Generic Structures." J. Chem. Inf. Comput. Sci. 1982, 22, 160-164. Nakayama, T.; Fujiwara, Y. "Computer Representation of Generic Chemical Structures by an Extended Block-Cutpoint Tree" J. Chem. Inf. Comput. Sc 1983, 23, 80-87. Kudo, Y.; Chihara H. "Chemical Substance Retrieval System for Searching Generic Representations. 1. A Prototype System for the Gazetted List of Existing Chemical Substances of Japan" J. Chem. Inf. Comput. Sci. 1983, 23, 109-117.)