1. Field of the Invention
This invention relates to a method of storing chemical structure data in a storage device and searching said chemical structure data using a query chemical structure by examining the match or analogy between the query structure and the stored structure data.
2. Description of the Prior Art
Recently various information including patent information is more and more handled by computers. Textual data, which consist of alphanumerics, such as patent claim information or technical information are now stored as a database and specific pieces of information are easily retrieved by searching the database. In the textual database, keywords are picked up from each piece of information, generally called a record, and those keywords are sorted alphabetically in the database as an inverted file. A search is conducted by combining a record list of eack keyword using the Boolean operation with AND, OR, or NOT logical operators. The basic idea of this method was introduced early in 1960's. The first computer system using this method was introduced in the 1970's in the United States. Most of the current online information retrieval systems use this type of textual data retrieval method.
On the other hand, storage and retrieval of chemical structure information, which is a graphic data in nature, was not so easy to achieve as that of textual data. Handling of chemical substance data is discussed in a book, "Chemical Information System", edited by J. E. Ash and E. Hyde, Ellis Horwood Ltd., 1975. A related U.S. Pat. is No. 4,085,443 by Araki. It was only early in the 1980's when the chemical structure storage and retrieval systems were available commercially. An inverted file which is used to handle textual information is not applicable to graphic data such as chemical structure data. Rather it is necessary to compare atoms and bonds of a query chemical structure with those of each chemical structure stored in a database to find a match between those structures. In order to do this comparison, it is necessary to create and keep so-called connection tables for all query and file structures. Since this comparison requires tracking atom connections one by one, it is usually called an iterative search. The iterative search consumes much computer time and affects the overall search time considerably, and it is necessary to minimize the number of candidate file structures to which iterative searches are to be conducted by screening out most of the "unwanted" structures. The screening is achieved by checking for the presence or absence of particular chemical characteristics called screens requested by the query structure. For example, if the query structure contains a nitrogen atom, any file structures which does not have nitrogen atoms will be screened out. In a current commercial system, screens are created automatically by a computer, when a query structure was created through an interactive session on a remote graphic terminal.
Thus, the current chemical structure search systems can handle specifically defined structures which are ususally found in technical journals. On the other hand, a generic expression of a chemical structure is widely used in patent claims to widen the coverage of those claims. Specific examples of such generic expressions are:
Alkyl groups with C1-C5 chain. PA1 Aromatic rings (i.e., benzene or naphthalene) PA1 Heterocyclic (i.e., rings containing one or more non-carbon atoms) groups with a ring size of 5 or 6.
The generic expression often covers thousands or millions of specific chemical structures, and allows one to expand the scope of a claim without specifically identifying each structure. Since chemical substances themselves are patentable in most countries, it is very important to store and search the generic chemical structures. The current status of the handling of generic chemical structures is discussed thoroughly in the following references.
(1) "Computer Storage and Retrieval of Generic Chemical Structures in Patents. 1. Introduction and General Strategy" by M. F. Lynch, S. M. Welford, and J. M. Bernard, J. Chem. Inf. Comput. Sci., 1981, (21), 148-150. PA0 (2) "Computer Storage and Retrieval of Generic Shemical Structures in Patents. 2. GENSAL, a Formal language for the Description of Generic Chemical Structures" by J. M. Barnard, M. F. Lynch, and S. M. Welford, J. Chem. Inf. Comput. Sci., 1981, (21), 151-161. PA0 (3) "Computer Storage and Retrieval of Generic Chemical Structures in Patents. 3. Chemical Grammars and their Role in the Manipulation of Chemical Structures" by S. M. Welford, M. F. Lynch, and J. M. Barnard, J. Chem. Inf. Comput. Sci., 1981, (21), 157-163. PA0 (4) "Computer Storage and Retrieval of Generic Structures in Chemical Patents. 4. An Extended Connection Table Representation for Generic Structures" by J. M. Barnard, M. F. Lynch, and S. M. Welford, J. Chem. Inf. Comput. Sci., 1982, (22), 160-164. PA0 (5) "Chemical Substance Retrieval System for Searching Generic Representations. 1. A Prototype System for the Gazetted List of Existing Chemical Substances of Japan" by Y. Kudo and H. Chihara, J. Chem. Inf. Comput. Sci., 1983, (23), 109-117. PA0 (6) "Computer Storage and Retrieval of Generic Chemical Structures in Patents. 5. Algorithmic Generation of Fragment Descriptors for Generic Structure Screening" by S. M. Welford, M. F. Lynch, and J. M. Barnard, J. Chem. Inf. Comput. Sci., 1984 (24), 57-66. PA0 (7) "Computer Storage and Retrieval of Generic Chemical Structures in Patents. 6. An Interpreter Program for the Generic Structure Description Language GENSAL" by J. W. Barnard, M. F. Lynch, and S. M. Welford, J. Chem. Inf. Comput. Sci., 1984 (24), 66-71. PA0 (8) "A Relaxation Algorithm for Generic Chemical Structure Screening" by A. Von Scholley, J. Chem. Inf. Comput. Sci., 1984 (24) 235-241.
Because of its complexity, no system can handle generic chemical structures successfully until now, except that two approach were made to solve the problem partially.
(APPROACH A)
One approach is to store specific structures expressed by the generic structures. Practically a database containing structure information of substances specifically identified in patent examples is widely used. One example is the Registry File of CAS ONLINE. But patent examples usually describe only a portion of the generic structures in a claim, and thus it is not usually true that the combination of all chemical structures in the examples corresponds to the claimed generic expression. It is certainly not practical to expand generic structures into component specific structures, since the number of specific structures derived from one generic structure easily explodes to millions. For example, an expression C4-C5 alkyl group represents 12 specific alkyl radicals. If a generic structure contains three of these expressions, the combination will result in 12.times.12.times.12 or 1728 specific structures.
(APPROACH B)
The other approach is to define codes for various chemically significant units, such as rings, chains and functional groups, and search the structure via those codes like keywords of textual detabases. The examples are the World Patent Index of Derwent or Comprehensive Database of IFI. In this approach, the expression C4-C5 alkyl group may be coded into two keywords, C4 and C5. Thus even a very complex generic structure can be coded fairly simply. One shortcoming of this approach is that a searcher has to know the coding rule and use the necessary codes explicitly. For example, in searching for a propyl group, one has to specify keywords both PROPYL and C3 ALKYL. But a bigger problem is that the coding system cannot express the connection between the chemical units successfully. This results in large number of irrevelant answers, which are usually called noise. Often more than 90% of the answer structures are noise. Another disadvantage is that since the file has no connection tables, or exact representation of chemical structures, it is unable to search by structures, as one can do in the system based on specific connection tables. Thus the searcher needs to learn how to use the code system to code a query structure effectively. Apparently, this prevents the system from being used widely.