It is known in the art to use statistical techniques to evaluate libraries of documents to extract usable information. Furthermore, it is known in the art to convert and manipulate chemical structures using computer analyses and algorithms. These techniques fall short of providing an environment in which new chemical entities can be identified, let alone one in which new chemical entities can be identified which relate to a particular biological target or particular subject matter.
Currently, in machine learning and statistics, one way to assess a similarity between, say, chemical entities represented by chemical identifiers such as chemical structure formulas, is to convert the chemical structure formula into a coded representation. It is also known to use analytic procedures to convert a symbolic representation (e.g., chemical identifier) of a molecule (e.g., chemical entity) into a useful number or value for the purpose of comparing, as one example, one chemical entity to another. For example a variety of descriptors is known and can be used in lieu of keybit binary representations in order to generate values that are useful in implementing certain embodiments of the invention. As non-limiting examples, known descriptors include 0D (i.e., constitutional descriptors), 1D (i.e., lists of structural fragments), 2D (i.e., graph variants), 3D (i.e., quantum-chemical descriptors), and/or 4D (i.e., GRID).
When there are a large number of variables in the dataset, such as in multivariable datasets defined by the keysets mentioned above, dimensionality reduction techniques can be used to evaluate the datasets. These techniques can be used to reduce datasets to a few principal variables in order to more easily visualize the relationship between datasets. Node or diffusion mapping algorithms, for instance, can be used to embed high-dimensional data sets into, say, a Euclidean space. Using this technique, the coordinates of each data point in the Euclidean space are computed from the eigenvectors and eigenvalues (i.e., non-zero vectors or values that, when multiplied by a matrix, generate multiples of the vectors or values). Such mapping techniques are computationally inexpensive and are useful in reducing and displaying visually-complex multivariable datasets such as product reviews, internet traffic, and E-commerce reports.
The techniques discussed above are all appropriate for mapping chemical structures that are represented by respective datasets. Turning to the question of new chemical entity discovery, however, while there exist chemical compound discovery techniques that are useful in identifying novel chemical compounds, current systems are not able to generate additional compounds in the low-dimensional space.
One technique for compound discovery which is used in identifying therapeutic compounds is scaffold hopping. Scaffold-hopping is used to identify isofunctional molecular structures with significantly different molecular backbones. Some types of scaffold-hopping include, but are not limited to, heterocycle replacements, ring opening or closure, peptidomimetics and topology-based hopping techniques. Other bioisosteric replacement techniques are also useful in predicting and evaluating new chemical compounds.
In short, current analysis systems are configured to process large variable data sets and present lower dimensional (e.g., 2- or 3-dimensions) visualizations to a user. Yet these systems are not configured to generate additional data relating to a chemical that might be further included or missing from the data set, and are entirely unable to identify absent chemical structures that conform to a reduced dimensional space.
Therefore, what is needed in the art is a system and a method which can construct an artificial environment which is trained around a particular biologic target or subject matter, such as a virtual manifold or a virtual array of nodes, from which common chemical features can be identified, transformed into new coded forms and inserted into the artificial environment for determining whether its placement within the artificial environment fits at least one prescribed criterion. What is further needed in the art is a system and method for predicting and generating chemical identifiers that describe new chemical entities not currently found within the source documents used to generate the artificial environment, yet which fill gaps in the artificial environment. The present invention addresses these and other needs.