1. Field of the Invention
The present invention generally relates to computer-based and/or computer-assisted calculation of the chemical and/or textual similarity of chemical structures, compounds, and/or molecules and, more particularly, to ranking the similarity of chemical structures, compounds, and/or molecules with regard to the chemical and/or textual description of, for example, a user""s probe proposed, and/or lead compound(s).
2. Background Description
In recent years, pharmaceutical companies have developed large collections of chemical structures, compounds, or molecules. Typically, one or more employees of such a company will find that a particular structure in the collection has an interesting chemical and/or biological activity (e.g., a property that could lead to a new drug, or a new understanding of a biological phenomenon).
Similarity searches are a standard tool for drug discovery. A large portion of the effort expended in the early stages of a drug discovery project is dedicated to finding xe2x80x9cleadxe2x80x9d compounds (i.e., compounds which can lead the project to an eventual drug). Lead compounds are often identified by a process of screening chemical databases for compounds xe2x80x9csimilarxe2x80x9d to a probe compound of known activity against the biological target of interest. Computational approaches to chemical database screening have become a foundation of the drug industry because the size of most commercial and proprietary collections has grown dramatically over the last decade.
Chemical similarity algorithms operate over representations of chemical structure based on various types of features called descriptors. Descriptors include the class of two dimensional representations and the class of three dimensional representations. As will be recognized by those skilled in the art, two dimensional representations include, for example, standard atom pair descriptors, standard topological torsion descriptors, standard charge pair descriptors, standard hydrophobic pair descriptors, and standard inherent descriptors of properties of the atoms themselves. By way of illustration, regarding the atom pair descriptors, for every pair of atoms in the chemical structure, a descriptor is established or built from the type of atom, some of its chemical properties, and its distance from the other atom in the pair.
Three dimensional representations include, for example, standard descriptors accounting for the geometry of the chemical structure of interest, as mentioned above. Geometry descriptors may take into account, for example, the fact that a first atom is a short distance away in three dimensions from a second atom, although the first atom may be twenty bonds away from the second atom. Topological similarity searches, especially those based on comparing lists of pre-computed descriptors, are computationally very inexpensive.
The vector space model of chemical similarity involves the representation of chemical compounds as feature vectors. As will be recognized by those skilled in the art, exemplary features include substructure descriptors such as atom pairs (see Carhart, R. E.; Smith, D. H.; Venkataraghavan, R., xe2x80x9cAtom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applicationsxe2x80x9d, J. Chem. Inf. Comp. Sci. 1985, 25:64-73) and/or topological torsions (see Nilakantan, R.; Bauman, N.; Dixon, J. S; Venkataraghavan, R., xe2x80x9cTopological Torsions: A New Molecular Descriptor for SAR Applicationsxe2x80x9d, J. Chem. Inf. Comp. Sci. 1987, 27:82-85), all incorporated herein by reference.
As seen, many strategies for representing molecules in the collection and computing similarity between them have been devised. We have recognized, however, that these searches are often more involved when the goal is to select compounds that have similar activity or properties, but not obviously similar structure. That is, we have identified a need to ascertain, from a large collection of chemical structures, compounds, or molecules, a set of diverse chemical structures, for example, that may look dissimilar from the original probe compound, but exhibit similar chemical or biological activity. We have also recognized that although algorithms using, for example, Dice-type and/or Tanimoto-type coefficients, each known to those skilled in the art, by design, yield compounds that are most similar to the probe compound, such algorithms may fail to provide compounds or chemical structures characterized by diversity relative to the probe compound.
With respect to a chemical example, if a particular compound were found to be a HIV inhibitor, we have recognized that it would be desirable to search a database of chemical compounds or compositions and identify HIV inhibitors that have the same or similar pharmacological effect as the original HIV inhibitor, but that may be structurally dissimilar to the original HIV inhibitor probe. The capability of being able to find one or more dissimilar HIV inhibitors quickly and effectively can potentially be worth billions of dollars in revenue.
We have also recognized that utilizing a probe and providing a database that includes a textual description in addition to a chemical description reveals correlations and relationships therebetween that cannot be obtained by utilizing either textual or chemical descriptors alone.
Latent Semantic Indexing and Latent Semantic Structure Indexing
The present invention, called Text Influenced Molecular Indexing (TIMI), expands upon the Latent Semantic Indexing (LSI) methodology described in U.S. Pat. No. 4,839,853 to Deerwester et al., incorporated herein by reference.
Deerwester discloses a methodology for retrieving textual data objects, in response to a user""s query, principally by representing a collection of text documents as a term-document matrix for the purpose of retrieving documents from a corpus. Deerwester postulates that there is an underlying latent semantic structure in word usage data that is partially hidden or obscured by the variability of word choice. A statistical approach is utilized to estimate this latent semantic structure and uncover the latent meaning. Deerwester shows that given the partial Singular Value Decomposition (SVD) of matrix X, it is possible to compute similarities between language terms, between documents, and between a term and a document. The SVD technique is well-known in the mathematical and computational arts and has been used in many scientific and engineering applications including signal and spectral analysis. Furthermore, Deerwester computes the similarity of ad hoc queries (column vectors which do not exist in X) to both the terms and the documents in the database.
Specifically, and referring to FIG. 1, the method disclosed in Deerwester comprises the following steps. The first processing activity, as illustrated by processing block 100, is that of text processing. All the combined text is preprocessed to identify terms and possible compound noun phrases. First, phrases are found by identifying all words between (1) a precompiled list of stop words; or (2) punctuation marks; or (3) parenthetical remarks.
To obtain more stable estimates of word frequencies, all inflectional suffixes (past tense, plurals, adverbials, progressive tense, and so forth) are removed from the words. Inflectional suffixes, in contrast to derivational suffixes, are those that do not usually change the meaning of the base word. (For example, removing the xe2x80x9csxe2x80x9d from xe2x80x9cboysxe2x80x9d does not change the meaning of the base word whereas stripping xe2x80x9cationxe2x80x9d from xe2x80x9cinformationxe2x80x9d does change the meaning). Since no single set of pattern-action rules can correctly describe English language, the suffix stripper sub-program may contain an exception list.
The next step to the processing is represented by block 110. Based upon the earlier text preprocessing, a system lexicon is created. The lexicon includes both single word and noun phrases. The noun phrases provide for a richer semantic space. For example, the xe2x80x9cinformationxe2x80x9d in xe2x80x9cinformation retrievalxe2x80x9d and xe2x80x9cinformation theoryxe2x80x9d have different meanings. Treating these as separate terms places each of the compounds at different places in the k-dimensional space. (for a word in radically different semantic environments, treating it as a single word tends to place the word in a meaningless place in k-dimensional space, whereas treating each of its different semantic environments separately using separate compounds yields spatial differentiation).
Compound noun phrases may be extracted using a simplified, automatic procedure. First, phrases are found using the xe2x80x9cpseudoxe2x80x9d parsing technique described with respect to step 100. Then all left and right branching subphrases are found. Any phrase or subphrase that occurs in more than one document is a potential compound phrase. Compound phrases may range from two to many words (e.g., xe2x80x9csemi-insulating Fe-doped InP current blocking layerxe2x80x9d). From these potential compound phrases, all longest-matching phrases as well as single words making up the compounds are entered into the lexicon base to obtain spatial separation.
In the illustrative embodiment, all inflectionally stripped single words occurring in more than one document and that are not on the list of most frequently used words in English (such as xe2x80x9cthexe2x80x9d, xe2x80x9candxe2x80x9d) are also included in the system lexicon. Typically, the exclusion list comprises about 150 common words.
From the list of lexicon terms, the Term-by-Document matrix is created, as depicted by processing block 120. In one exemplary situation, the matrix contained 7100 terms and 728 documents representing 480 groups.
The next step is to perform the singular value decomposition on the Term-by-Document matrix, as depicted by processing block 130. This analysis is only effected once (or each time there is a significant update in the storage files).
The last step in processing the documents prior to a user query is depicted by block 140. In order to relate a selected document to the group responsible for that document, an organizational database is constructed. This latter database may contain, for instance, the group manager""s name and the manager""s mail address.
The user query processing activity is depicted in FIG. 2. The first step, as represented by processing block 200, is to preprocess the query in the same way as the original documents.
As then depicted by block 210 the longest matching compound phrases as well as single words not part of compound phrases are extracted from the query. For each query term also contained in the system lexicon, the k-dimensional vector is located. The query vector is the weighted vector average of the k-dimensional vectors. Processing block 220 depicts the generation step for the query vector.
The next step in the query processing is depicted by processing block 230. In order that the best matching document is located, the query vector is compared to all documents in the space. The similarity metric used is the cosine between the query vector and the document vectors. A cosine of 1.0 would indicate that the query vector and the document vector were on top of one another in the space. The cosine metric is similar to a dot product measure except that it ignores the magnitude of the vectors and simply uses the angle between the vectors being compared.
The cosines are sorted, as depicted by processing block 240, and for each of the best N matching documents (typically N=8), the value of the cosine along with organizational information corresponding to the document""s group are displayed to the user, as depicted by processing block 250.
Thus, in Deerwester, words, the text objects, and the user queries are processed to extract this underlying meaning and the new, latent semantic structure domain is then used to represent and retrieve information. However, Deerwester fails to suggest any relevance to chemical structures, as neither a recognition of the instant need, nor a recognition of a solution thereto is addressed. Further, for calculation of object similarities LSI uses, for example, singular values to scale the singular vectors for calculation of object similarities.
A need exists, therefore, for a chemical search system method that combines the utility of both a text-based and composition-based search techniques, and additionally/optionally provides synergistic effects therebetween. The present invention fulfills this need by providing such a system and method.
It is therefore a feature and advantage of the present invention to provide a method and/or system that utilizes a collection of chemical structures, compounds or molecules, and associated textual descriptions thereof, to determine the chemical and textual similarity between the collection of chemical structures and a probe or other proposed chemical structure.
It is a further feature and advantage of the present invention to provide a methodology for calculating the similarity of chemical compounds to chemical and text based probes or other proposed chemical structure.
It is another feature and advantage of the present invention to provide a method and/or system for selecting, based on chemical and text based probes or other proposed chemical structure, chemical compounds that have similar biological or chemical activities or properties, but not necessarily obviously similar structures.
It is another feature and advantage of the present invention to provide a computer readable medium including instructions being executable by a computer, the instructions instructing the computer to generate a searchable representation of chemical structures, given chemical and text based probes or other proposed chemical structure.
The present invention combines both the textual and chemical descriptors of chemical compositions, mixtures, and/or compounds to determine the textual and chemical similarity of those chemical compositions, mixtures, and/or compounds to either an existing descriptor or a user provided descriptor. By providing textual descriptors in addition to the chemical descriptors representing each compound, the present invention advantageously provides an integrated system and method that uncovers relationships between the textual and chemical descriptors that cannot be uncovered using either method. Specifically, as described in detail below, the present invention reveals associations between the text and chemical descriptors that could not be found by combining separate text and chemical analyses, as will be discussed in further detail herein. The following disclosure describes how this merging is done, and provides several retrieval and data mining scenarios using Medline abstracts by way of example.
The method of the present invention, in various embodiments described herein, calculates the similarity between a first chemical or textual descriptor and at least one other chemical and/or textual descriptor in a matrix comprising a plurality of chemical and textual descriptors, and includes the sequential, non-sequential and/or sequence independent steps of creating at least one chemical descriptor and at least one text descriptor for each compound in a collection of compounds, and preparing a descriptor matrix X. In a preferred embodiment, each column of the descriptor matrix represents a document containing textual and chemical descriptions, and each row contains a descriptor associated with at least one document. The numbers stored in the row equal the number of instances of occurrences of each descriptor within each document. It will also be obvious to those skilled in the art that the rows and columns of the descriptor matrix X can be transposed, and that, in such a case, the operations performed on the descriptor matrix X described hereinbelow can be modified accordingly such that results of the operations performed on the transposed matrix are identical to the results of the descriptor matrix X. Then, in a preferred embodiment, a singular value decomposition (SVD) of the descriptor matrix is performed, producing resultant matrices that are used to compute the similarity between a first descriptor and at least one other descriptor. As previously noted, however, other suitable decomposition techniques, such as principal component analysis, can also be utilized. Finally, at least a subset of the at least one other descriptor ranked in order of similarity to the first descriptor is provided as output.
There has thus been outlined, rather broadly, the more important features of the invention in order that the detailed description thereof that follows may be better understood, and in order that the present contribution to the art may be better appreciated. There are, of course, additional features of the invention that will be described hereinafter and which will form the subject matter of the claims appended hereto.
In this respect, before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and to the arrangements of the components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting.
As such, those skilled in the art will appreciate that the conception, upon which this disclosure is based, may readily be utilized as a basis for the designing of other structures, methods and systems for carrying out the several purposes of the present invention. It is important, therefore, that the claims be regarded as including such equivalent constructions insofar as they do not depart from the spirit and scope of the present invention.
Further, the purpose of the foregoing abstract is to enable the U.S. Patent and Trademark Office and the public generally, and especially the scientists, engineers and practitioners in the art who are not familiar with patent or legal terms or phraseology, to determine quickly from a cursory inspection the nature and essence of the technical disclosure of the application. The abstract is neither intended to define the invention of the application, which is measured by the claims, nor is it intended to be limiting as to the scope of the invention in any way.
These together with other objects of the invention, along with the various features of novelty which characterize the invention, are pointed out with particularity in the claims annexed to and forming a part of this disclosure. For a better understanding of the invention, its operating advantages and the specific objects attained by its uses, reference should be had to the accompanying drawings and descriptive matter in which there is illustrated preferred embodiments of the invention.