Computer-Based Segmenting Algorithms
The use of computer-based segmenting algorithms to segment a group of sequential data into like parts (similar subgroups) is a known technique.I Such segmenting algorithms collect data values into similar subgroups, wherein each subgroup corresponds or belongs to a segment. These algorithms and methods essentially “segment” the data, so that data (or data values) within each segment are essentially homogeneous (see FIG. 5 in the Appendix as an example). And a measure of the homogeneity of the data in each segment is frequently calculated. And an overall (for all the segments combined) measure of the homogeneity of the data (or data values) in each segment is frequently calculated. An important advantage of these segmenting algorithms is for correlation purposes. (Ref 1 endnotes, page 390) I Hawkins D M, Merriam D F, Optimal Zonation of Digitized Sequential Data. Mathematical Geology, vol 5, No. 4, 1973, pp. 389-394.
Data or data points in such a segmented form is often easier to work with and easier to understand. For this reason computer-based processes that “segment” such data, as well as data in segmented form have great utility. Applications of such data segmenting processes, as well as data in segmented form occur in a multitude of fields. Even in the field of geology there are many such applications to geological data, these include mechanical logs of bore holes, x-ray data, seismic traces, magnetic profiles, and land-resource observations made along transects. (see reference 1 endnotes, p. 390).
A dynamic programming (DP) segmenting algorithm was developed by Hawkins. This Hawkins DP algorithm finds one or more essentially optimal data segmentations or “coverings” by essentially calculating an overall measure of segment homogeneity for each possible segmentation (or covering).II One or more coverings with the optimal value of overall homogeneity are then selected by the algorithm. This DP algorithm was an improvement, in terms of running time, over non-DP approaches. (see Reference 1, pp. 390-391 and Description section for more details) II In this patent application, the terms “segmentation”, “covering” and “split” are equivalent or essentially equivalent.
Recursive Segmenting, Methods of Recursive Partitioning
Segmenting techniques have continued to evolve. For example, one or more segmenting algorithms have frequently been used to segment data recursively (or repeatedly). Such recursive techniques result in a recursive partitioning (RP) of data into subgroups. One known computer-based scheme that uses a combination of segmenting algorithms and RP techniques is FIRM. FIRM stands for Formal Inference-based Recursive Modeling. FIRM was developed by Professor Hawkins and is publicly available (see Description section for more details).
Conventional Segmenting Techniques Limited by Long Computer Running Times
Despite continued evolution of segmenting techniques, these techniques continue to have a major limitation. This major limitation of conventional segmenting algorithms is that they frequently work slowly with large amounts of data or large numbers of data points. The Hawkins DP algorithm also has this limitation.
The long running times of conventional segmenting algorithms are a significant problem for many potential applied fields of usage of segmenting techniques. This significant problem exists in the area of computational chemistry, high-throughput screening of pharmaceuticals and genetics analysis, where the amount of data to be segmented is enormous.
The Great Need for Better High Throughput Screening of Pharmaceuticals
A veritable explosion in the number of compounds available as potential pharmaceuticals has recently taken place. Large numbers of different types of compounds are being physically tested for biological, medical and pharmaceutical properties. And a vast amount of information or data on both tested and untested compounds is being accumulated. Such data is being stored in large chemical libraries. Such libraries have both general and specific (focused) data on chemical compounds that are potential pharmaceuticals.
In addition, the number of potential pharmaceuticals will be greatly increased by the Human Genome Project. This project will identify numerous new “drug targets”. These targets are places at the molecular level for a drug to act or exert its effect. Such an increase in drug targets will also greatly increase the number of potential pharmaceutical compounds.
Research and development to find new and useful pharmaceuticals has usually required sifting through large numbers of candidate compounds in order to find promising candidates. One method of screening candidate compounds is to physically test the candidate compounds. In it's simplest form, screening by physical testing is essentially “trial and error” and requires testing essentially every candidate. Even more sophisticated physical testing procedures require a great deal of effort, time and expense.
Current methods of screening large numbers of candidates are known as high throughput screening (HTS). Significant advances in the technology for the testing of compounds for desirable pharmaceutical properties have occurred, yet HTS still has great deficiencies.
Current HTS techniques simply cannot screen the number of newly available potential candidate pharmaceuticals. Limitations in current HTS methods cause delays in bringing drugs to market, resulting in great losses in potential profits. And many large-scale high throughput screening attempts still fail to identify a good lead compound (prototype drug molecule) to stimulate further research.
Computer-Based Methods of Screening Pharmaceutical Candidates have the Potential to Save Expense, Time and Work in High Throughput Screening.
Computer-based methods of screening molecules (or compounds) are methods of reducing the workload, time and expense of screening by physical testing. Such computational approaches attempt to identify promising candidate compounds (or molecules) with desirable pharmaceutical properties.
For example, a certain group of compounds may be known to possess a desirable pharmaceutical property. A computer or human judgment then identifies molecular or chemical characteristics of the compounds in this group. A computer-based identification of other compounds that have the same (or similar) molecular characteristics is then done to form a new group of promising candidate pharmaceutical compounds. The candidate compounds (or molecules) in this new group has an increased probability of possessing the desired property, despite having not been actually physically tested.
Thus, a promising new group of candidate pharmaceuticals has been identified without the actual physical testing of the compounds in the group. And much work, time and expense have been saved. The compounds in the group can then be subjected to further investigation.
Computational HTS Using QSAR
Most important computational screening approaches are based on the idea that a particular pharmaceutical property of a compound is due to the compound's molecular structure. In effect these approaches assume that the property is due to the compound's shape at the molecular level. Such “quantitative structure-activity relationship” or QSAR approaches attempt to characterize the parts of a molecule's shape that contribute to the pharmaceutical property or “activity”. Such important molecular parts (pieces of a molecule) are sometimes referred to as pharmacophores. Just as keys fit into a lock, molecular parts such as pharmacophores of the right shape cause their effects by fitting into other “target molecules” in the human body. (These target molecules are sometimes called receptors.) In effect, QSAR approaches are similar to looking for “molecular puzzle pieces”—pharmacophores or molecular parts having about the same molecular shape or characteristics.
Most Computational HTS Methods Using QSAR Approaches are too Idealized to Handle Real-World Situations
Most computational QSAR approaches use idealized mathematical and statistical models. However, these idealized models are too simplistic to accommodate the complexities of real world molecular structure and the structure-activity relationship between a drug and it's target. Real world molecular structures (and QSARs) exhibit complexities that are not idealized. Therefore there is a great need for more realistic methods of computational high throughput screening using QSAR approaches.
Methods of Recursive Partitioning are Realistic and Can Deal with Realities of Computational HTS
Methods of recursive partitioning (RP) can deal with realities of computational HTS, including those of computation HTS methods that use QSAR approaches. Methods of RP are able, for example, to handle realities such as interaction effects, threshold effects and nonlinearities. This realization has spawned the development of new methods of RP in high throughput screening.
Some Recent Methods of RP in Computational HTS
One such recent method uses RP techniques to separate drug candidates into subgroups (or nodes of a tree), wherein drugs in nodes are similar in terms of number of specific molecular fragments and potency.III A second RP method generates binary trees, wherein each node is split into two daughter nodes. In this method drugs are grouped into nodes, wherein drugs in nodes are similar in terms biological activity and only one of the two categories of (1) presence or (2) absence of specific chemical descriptors.IV III Hawkins, et. al. Analysis of Large Structure-Activity Data Set Using Recursive Partitioning. Quant. Struct.—Act.Relat. 16, 296-302 (1997).IV Published PCT patent application PCT/US98/07899, publication date Oct. 22, 1998.
Even New RP Methods of HTS (Including those that Use QSAR Approaches) are often Essentially Limited to Binary Splitting or Small Data Sets.
A third RP method uses chemical or molecular descriptors that are generated from 2D topological representations of molecular structures. Such descriptors include atom pairs separated by minimal topological distance, topological torsions and atom triples employing shortest path lengths between atoms in a triple. This third method while using distance and topological type descriptors also generates only binary trees. Thus the method is also essentially limited to a presence or absence type of categorization (or splitting). This reference indicates that segmenting into more than two daughter nodes using techniques such as FIRM is essentially limited to working with small amounts of data, because of increases in computer run time.V This reference essentially indicates that viable general RP packages for HTS are limited-to small data sets. See also related U.S. Pat. No. 6,434,542. V Rusinko, et. al., Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning. J. Chem. Inf. Comput. Sci. 1999, 39, 1017-1026. “In contrast to data partitioning via continuous variables, binary classification trees can be computed very quickly and efficiently since there are far fewer and much simpler calculations involved. For example, FIRM develops rules for splitting based on “binning” of continuous variables and amalgamating contiguous groups. These procedures add considerably to execution time and hence limit the interactive nature of most general recursive partitioning packages to data sets much smaller than those under consideration. With binary data, on the other hand, a parent node can only be split into two and only two daughter nodes.” (p. 1019)
There is a Great, Unmet need for Faster Computational HTS-QSAR, RP Techniques Employing Multi-Way Splitting Using Geometry-Based Molecular Descriptors.
Binary splitting is essentially a two category, (1) presence or (2) absence type approach. Such binary splitting cannot take full advantage of the dimensional measurement information present in continuous variables or descriptors such as distance type descriptors.
By contrast, multi-way splitting (or categorization) is generally more versatile than mere binary splitting. Like an ordinary ruler, multi-way splitting divides quantities such as distances into gradated segments based on number measurement. If such multi-way splitting could be done using geometry-based molecular descriptors (such as molecular descriptors based on distances between parts of a molecule), there would be a fuller and more natural use of the actual dimensional measurement information present in geometry-based molecular descriptors. Molecules could then be sorted into segments wherein the molecules in each segment have about the same actual geometric measurements of like molecular parts.
However, this great need of multi-way segmenting using geometry-based descriptors has remained unfulfilled. This is because conventional HTS-QSAR, RP techniques with distance type descriptors are essentially only viable with binary splitting. These conventional techniques, which use conventional segmenting algorithms, are too slow to do multi-way splitting.
Fast Segmenting Algorithms make Possible Computational HTS-QSAR Approaches that Employ Multi-Way Splitting RP Techniques with Geometry-Based Molecular Descriptors.
The inventor's novel Fast Segmenting Algorithms make multi-way splitting using geometry-based molecular descriptors a reality by greatly increasing speed and decreasing computer run times. These Fast Segmenting Algorithms (FSAs) lead to inventions that fulfill the great unmet need.
Versions of the Invention Fulfill the Great Need of True Segmenting Using Geometry-Based Descriptors in Computational HTS
Versions of the present invention are computer-based methods that perform multi-way segmenting on molecules (such as drug candidates) using geometry-based molecular descriptors. These computer-based methods use, or have the potential to use, one or more fast segmenting algorithms to perform their segmenting. Versions of the invention are viable RP software packages for multi-way segmenting of large data sets of drug candidates and the candidates' geometry-based molecular descriptors. These software packages are fast enough to allow a researcher to interact meaningfully with a package program during operation. Thus versions of the invention fulfill the great need for a computational RP segmenting method in pharmaceutical HTS that makes full and natural use of the dimensional measurement information present in geometry-based molecular descriptors.
Versions of the Invention Sort Candidate Molecules into Subgroups. The Molecules in Each Subgroup have Molecular Parts with About the Same Geometric Measurements. Pharmacophores Sought by HTS Methods are Important Examples of Such Molecular Parts.
Fast Segmenting Algorithms (FSAs) using geometry-based descriptors sort a group of candidate drug molecules into segments (or subgroups). The molecules in each segment (or subgroup) have molecular parts with about the same geometric measurements. When segmenting using geometry-based descriptors is done repeatedly (or recursively) group molecules are sorted into segments (or subgroups) on the basis of multiple geometric measurements. Such recursive segmenting or partitioning of a group of molecules generates a nodal tree (similar to the tree in FIG. 2). Group molecules are sorted into nodes (or subgroups) so that the molecules in each node have similar molecular parts, these parts have about the same actual geometric measurements. In effect, the nodal tree effectively sorts the molecules, so that molecules in some nodes have a molecular part or parts that are pharmacophores with about the same geometric measurements. This fuller, more natural use of geometric information makes for more powerful methods of finding molecules that are sought by computational HTS-QSAR procedures. In effect HTS-QSAR approaches that employ RP techniques and multi-way splitting with geometry-based descriptors can find (and predict) more exact and better fitting “molecular puzzle pieces” and molecules. These candidate drug molecules with molecular parts or pharmacophores are the “better fitting molecular puzzle pieces” that are the ultimate pursuit of computational HTS-QSAR procedures.
Some Details of the Operation of Versions of Fast Segmenting Algorithms
Conventional segmenting algorithms essentially compute an overall measure of segment homogeneity (sometimes referred to as a score) for all possible segmentations or splits of a data set. Versions of Fast Segmenting Algorithms (FSAS) achieve their increased speed by computing an overall measure of segment homogeneity (or a score value) for only some of the possible splits of a data set. In addition, some versions of FSAs compute a score value for only some select splits. These selected splits have a high probability of being a (or the) split with an optimal score value. FSAs also make use of techniques of dynamic programming such as running sums and updating. Thus versions of FSAs are fast DP algorithms that find one or more splits of a data set, wherein the splits are probable optimal splits.
There is a Multitude of Potential Applied Uses for Fast Segmenting Algorithms and Special Score Functions.
Just as there is a great need for fast segmenting techniques and FSAs in pharmaceutical high throughput screening, these techniques and algorithms have great potential in general chemistry or general computational chemistry. In addition, potential uses of fast segmenting techniques and algorithms are present in a multitude of fields. A few other examples of fields in which real-world data in segmented form has great utility include clinical trials analysis (relating physiological and environmental factors to clinical outcomes, genetics (relating genetic descriptions of organisms to other organism characteristics), geology (finding minerals and oil), modeling nosocomial infections in hospitals, market research (market segmentation), industrial quality improvement (wherein data are frequently “messy” or nonidealized), and demographic studies. (No reference, technique or invention is admitted to being prior art with respect to the present invention by it's mention in this background or summary.) Professor Hawkins has also invented novel measures of segment (or intra-segment) data homogeneity, special score functions (see below).