Chemical structures are typically represented in documents using graphical notations to provide a reader with a more complete understanding of relevant chemical information. For example, a chemical structure may be drawn using a representation such as a Lewis structure, skeletal formula, Newman projection, sawhorse projection, or Fischer projection, amongst others. A chemical structure may also be represented by a condensed formula that omits certain commonly understood constituent elements (e.g., bonds or terminal hydrogens) to simplify the overall representation of the structure. Graphical representations of chemical structures may be presented in documents in various contexts, for example, to illustrate the roles of corresponding chemicals in a chemical reaction, to describe a reaction product, or to provide a comparison between structurally similar, but chemically distinct entities. Frequently, the graphical representation of a chemical structure represents the key information in a document that identifies the chemical as relevant to a user for a particular desired purpose.
In order to reproduce chemical structures in a document, a range of standard formats are used to efficiently store the chemical structure data. One type of format uses connection tables, adjacency matrices, or similar data structures to relate atoms and bonds as edges and nodes. Another type of format uses linear string notations based on depth first or breadth first traversal. The use of standardized data formats for storing chemical structure data enables algorithmic searching of the data. Furthermore, chemical structure data in standard formats can be indexed with a document in a database.
A user will commonly perform a search of a database of documents to identify documents that refer to one or more relevant chemical structures. The user must enter an input that is capable of being compared to the chemical structures stored in the database of documents. The user may enter a query by providing chemical structure data or a characteristic name, such as one according to the International Union of Pure and Applied Chemistry (IUPAC) conventions. The user-provided input is converted to a standard format used to store chemical structure data in the database and compared against chemical structure data contained within indexed documents using a variety of techniques.
Generally, documents in a database responsive to a user's search are identified by determining similarity between chemical structures in the documents and the user provided input using graph-theory-based algorithmic approaches. Frequently, similarity is established by determining whether fragments (e.g., constituent elements) of the user-provided input structure are present in chemical structures in the documents. This may be done, for example, using a binary fingerprint of the chemical structure. If a sufficient number or proportion of fragments identified in the user's input are present in a chemical structure in a document, then similarity is established. The similarity may be used to screen out unrelated documents before searching the chemical structures in unscreened documents using an atom-by-atom comparison to establish the search results provided to the user. Alternatively, all documents containing chemical structures whose similarity to the search input exceeds a threshold may be provided as search results to the user.
Various algorithms have modified this basic approach of establishing similarity in order to accelerate search speed, such as the class of algorithms using hashed fingerprints. Accelerated search methods are necessary for efficiently searching for large molecules and/or searching in large datasets. When a database contains a very large number of documents comprising chemical structures, searching for relevant documents is cumbersome, as each chemical structure in every document must be searched against for similarity to the input structure. Such searches are slow and resource-intensive.
A user may only be interested in a subset of all possibly relevant documents based on some criteria other than the chemical structure alone. For example, the user may be interested in chemical structures related to input structures that have certain desirable properties, that may be synthesized with certain yields, or that exhibit certain reactivities. These additional search limitations are most conveniently provided by the user as text that may be used to search any text data in documents of a particular database. In order to search for both the user's input chemical structure and any additionally-provided text, one search for the chemical structure and one search for the text must be run sequentially.
The use of sequential searching does not significantly accelerate the searching of very large databases. A standard chemical structure search may be performed first to establish a set of potentially relevant documents based on the chemical structure input followed by a search of that set for documents containing the text search terms. However, this approach may be no faster than a search that does not contain any additional text terms. Performing a search for documents containing user input text terms first will quickly eliminate some documents from the set of potentially relevant documents. However, many search terms a user may input will not significantly reduce the number of potentially relevant documents so as to significantly accelerate the speed of the search. For example, if a user is searching for documents with related structures where reaction yields are over 90%, the set of documents where reaction yields are over 90% will still include a very large number of documents with unrelated chemical structures.
There is a need for systems and methods to more efficiently search large databases of documents referring to chemicals based on user-provided input. Additionally, there is a need for systems and methods to index a database of documents referring to chemicals for more efficient searching.