Modern database management systems have been used since the early 1970's. Commercial database systems mostly concentrate on finding exact matches. Searches are performed either to find a specific entry, or to find multiple entries having the same characteristics. Attributes of the data often become fields. An exact search can be made to find a specific person by looking up his name or social security number. A search can be performed to find multiple individuals having the same occupation or place of birth. Alternatively, one may locate all people born before a particular date. Whether a single entry or multiple entries are found, this type of query constitutes an exact search. Exact searches try to exactly or relationally match one or more fields in different data records.
Similarity searching of databases has been around for several years. A similarity search compares two or more entries in their entirety to determine how closely they match one another. Consider the following simple database containing entries of various animals that fly:                a house-fly        a bat        a hummingbird        a dragonfly        a flying fish        a hawkThe question: “Which are most similar?” is not meaningful without additional input. A proper answer requires input of the key dimension. If “feathers” represent the key dimension, then the hawk and the hummingbird are most similar. If “the ability to fly stationary” is the key dimension, then the dragonfly and the hummingbird are most similar. Other possible key dimensions could be metabolism, life span, body temperature, etc. Therefore, the answer to the question: “Which are most similar?” is subjective depending upon the preferences of the person asking the question.        
For a more complicated residential real estate database, a potential buyer would be looking to buy a home by expressing preferences that become the parameters for a similarity search. Such parameters might include number of bedrooms, type of house, asking price, neighborhood, quality of the local school system, property taxes, age restrictions on residents, home-owners' associations, etc. Currently, a real estate agent would first screen for homes having a specific most desirable characteristic (e.g., neighborhood or number of bedrooms). Then, the agent would look for the next desirable characteristic. The process would be repeated for each parameter, each search yielding a number of homes for consideration by the buyer. Where a particular home appears in the search results multiple times, it is more likely that the agent can make a sale. However, a feature vector may be created using these and other parameters, and a similarity search can be performed to match a potential buyer's preferences. This search would generate a list of homes approximating these preferences. A vector could indicate whether or not the buyer is interested in a particular feature. The homes can then be compared in their entirety by computing the mathematical difference between their feature vectors. The smaller the difference between the feature vectors of an ideal home and an available home, the more similar they are.
Representations of organic chemical compounds can be structurally decomposed into recognizable fragments, such as functional groups, carbocyclic rings, heterocyclic rings, aliphatic chains, and carbon-heteroatom chains. Many databases store molecular information according to their chemical structure fragments. The data processing systems maintain a fragment dictionary, and all compounds input into the database are parsed so as to generate a descriptor vector of each compound. Each element in the descriptor vector corresponds to a fragment in the fragment dictionary. The dictionary is instituted with fragments well known to those skilled in the art.
In other chemical database systems an algorithm replaces the chemical fragment dictionary. Such algorithms, in effect, generate chemical fragments on the fly and the fragments are then used to parse chemical structures and create a descriptor vector. The elements of the vector may directly correspond to chemical fragments or an algorithm, such as a hashing algorithm, may be used to generate an alternative relationship between the compounds and the vector elements.
One possible representation of a molecule would be to parse it into a binary fragment vector. Each bit represents the presence or absence of a particular fragment in the fragment dictionary or fragment generating algorithm. The binary vector may be represented logically as a string of bits or bytes or may have any convenient representation. The binary vector forms a fingerprint for the molecule. Each bit or fragment in the fingerprint is a dimension representing one row in the vector.
A chemical structure similarity search may be performed by calculating the similarity between the vector for a query compound and the vectors of the compounds stored in a chemical database. Standard measures of similarity such as Euclidean Distance, Tanimoto coefficient, Hamming Distance, Soergel Distance, Dice Coefficient, Cosine Coefficient or other similarity algorithms can be used for the similarity calculation. The results can then be returned in order of decreasing similarity.
In another application, chemical compounds, natural products, fermentation broths, and other substances are often tested for biological activity, or pharmacological activity. The results of these tests are often stored in electronic databases. One method that can be used to examine biological screening results and property data is similarity searching. A biological activity profile can be created by graphing the test results of a specific compound in multiple biological tests. Compounds possessing a similar mode of action often possess similar biological profiles. Biologists and chemists are thus often interested in searching a database of biological screening results for substances with an activity profile similar to a given biological activity profile. For example, in the development of an antibiotic a scientist might be interested in substances showing good activity against gram-positive bacteria and one species of gram-negative bacteria. The profile of such a substance would have strong activity values for the several gram-positive and one gram-negative bacteria under consideration and weak activity values for the rest of the gram-negative species tested. In addition to biological activity descriptors, chemical descriptors may be usefully included in the description of a compound. Chemical structure descriptors, as discussed above, or properties such as pKa (acidity/basicity), LogP (lipophilicity measurement), MR (molar refractivity), IR (index of refraction), hydrogen bond donor count, hydrogen bond acceptor count, heteroatom count, molecular weight, Rule of 5 value (Lipinski value), retention value (measured via gas chromatography, liquid chromatography, or thin layer chromatography), or spectroscopic peak measurements (infrared, nuclear magnetic resonance, ultraviolet) may be used to characterize the compounds. These descriptors can be included in compound fingerprints and used in both chemical and biological similarity calculations.
In this case, it would be desirable to create a vector where a specific element would refer to a particular feature or test, and the vector would contain numeric values. The difference between vectors may be measured, and difference values would represent the degree of similarity between entries in the database.
Variables may be standardized prior to performing the similarity calculations to insure that all variables are treated equally. For example, variables may be mean-centered and scaled to unit variance according to procedures that are well known in statistical art. (For example, see Weisberg, S., “Applied Linear Regression,” Second Edition, John Wiley and Sons, 1985, pp 185–186.) For a vector X with j elements:
            X      j        -                  x        _            j            SD    j  where:                 xj=mean of the elements of X, and        SDj=standard deviation of the elements of X.Other standardization techniques well known in the statistical art are also used.        
Similarity searching using vectors is prior art. The similarity between vectors may be measured, and magnitude of the similarity value would represent the degree of similarity between entries in the database. In current systems the results of the similarity calculations are, to a large extent, predetermined by the algorithm. However, a user of a similarity searching system may have different ideas about the importance of various descriptors that are used in the similarity calculation. For example, when chemists search for similarity in chemical compounds, some parts of the molecule are more important to them than other parts. A chemist may be searching for compounds whose biological activity is similar, or they may be searching for compounds whose syntheses are similar. The importance of fragment descriptors will vary depending upon the type of similarity sought. Chemists therefore, when performing a search, would be interested in establishing a higher search priority for the fragment descriptors they consider important and a lower search priority for the other fragment descriptors. Currently search priorities are defined by the search algorithm designer. It priorities of descriptors can be dynamically assigned by the user, then the results of the search will reflect what user desires. The units of assigned priority or weights can be arbitrary, and only their ratio to each other is important.
Another issue for scientists using chemical structure similarity searching is the need to perform multiple searches in order to find all relevant answers. Certain chemical functional groups, while structurally different, are known in the art to be similar to one another. For example, the syntheses of common heterocycles are often similar. A scientist searching for information related to the synthesis of one heterocycle might also be interested in the syntheses of related similar heterocycles. In addition, it is well known in the art that certain functional groups (bio-isosteres) often possess similar biological effects, even if they differ at the atomic level. For example, sulfonamide can often be substituted for carboxylic acid in biologically active compounds. Current similarity calculations based upon molecular structure do not take synthetic similarity or bio-isosteric similarity into account. If a scientist wishes to search for all of the compounds that are of interest, multiple queries must be entered and searched. Automation of secondary query generation and searching would simplify the task of the user.
In the previously mentioned residential real estate database, the similarity search revealed homes having all of the features that interested a potential buyer. Yet, for some potential buyers, certain items are more important than others. For example, for a family with four children, purchase of a house with five bedrooms and the quality of the school system might be more important than asking price and property taxes. Yet these latter features could also serve as influencing factors. In such a case, being able to assign higher priorities to certain features and lower priorities to other features would result in a more meaningful search.
It is important to note that where various fields in a data record are not numerically related to each other, the variations in their magnitude may be large thereby improperly biasing the search. In the example of the residential real estate database, the price of a house is represented by a very much larger number than the number of bedrooms in the house. If the data record is not standardized as mentioned earlier, the price will dominate the similarity calculation even where the number of rooms has a higher priority. By mean-centering the data fields and scaling them to unit variance, a more meaningful search can be performed.
The underlying mathematics for searching is very broadly applicable. It can be used inter alia in biology and medical databases, in physiology databases, in anthropology databases, in photography databases, and in taxonomy databases. It is practical where a characterization vector can be applied to the description of the data.
It is an object of the invention described herein to create a computerized system that will perform similarity searches in an electronic database where the entries have a set of one or more descriptive properties capable of being expressed in numeric form and wherein the user can assign weights or priorities to the descriptive properties so as to influence the similarity searches. It is a further object of the invention to provide a means for the generation of secondary queries from a primary query to insure that other relevant similar entries are retrieved from the database.