A computer program listing appendix is part of the disclosure and is incorporated herein by reference. The computer program listing appendix contained on compact disks contains the following files: Identification Information (1KB) and NLMJER.C (20KB). The disks were created on Sep. 15, 2003.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
1. Field Of The Invention
This invention relates to the field of computational molecular structural analysis of large data sets of molecular structures and more specifically to graphical displays that present an accurate qualitative representation of the distribution of molecular structures in the high dimensional space of molecular descriptors.
2. Background Of The Art
With the advent of high throughput screening (HTS), combinatorial synthesis, and analysis and selection of compounds from computer generated virtual libraries, research scientists, and pharmaceutical scientists in particular, are faced with an expanding problem of separating compounds of most significance to their work from a clutter of possibilities. In recent years an appreciation has developed that: 1) it is useful to think about how molecular structures populate a xe2x80x9cdiversity spacexe2x80x9d of all possible structures; 2) that structures generated from different synthetic routes may populate the same or different volumes of diversity space; and 3) that broad based screening programs should utilize compounds from across diversity space and avoid overscreening with compounds that densely occupy the same volume of diversity space.
Scientists in drug discovery research make decisions each day that affect the course of their projects. A decade ago, decisions were based on infrequent new biological data, and resulted in making small numbers of compounds per year. Today, high throughput screening laboratories generate a constant stream of new biological data and call for larger numbers of new compounds to be made ever faster by combinatorial chemistry laboratories.
Decisions about which compounds to acquire or synthesize to test next are based in part on the output of computations utilizing advanced molecular structural descriptors. The simplest drug discovery principle is that compounds similar in enough properties are usually similar in biological activity. Similarity often involves measures in high-dimensional spaces, such as molecular fingerprints or shape descriptors which typically utilize around one-thousand dimensions. Uses of similarity in drug discovery research may apply these high-dimensional descriptors to millions of compounds from virtual libraries of potentially synthesizable compounds or to libraries of synthesized compounds which have been generated.
The method of this invention enables scientists to examine relationships among the vast numbers of compounds in high-dimensional diversity space in a familiar two-dimensional visual map context. The method for visualization of high-dimensional diversity spaces relies on the implementation of horizons, which are distances beyond which the distance matrix between compounds need not be resolved, and on efficient subsampling methods. The method also enables the selection of optimal descriptors to cluster compounds for predictive use when combined in genetic algorithms. Optimal descriptors help not only in visualizing important features of diversity space, but in deciding which compounds to make and test next during early analoging of active substances.