Traditionally, the design of novel molecular species (e.g. drugs) has essentially been a trial-and-error process despite the tremendous efforts devoted to it by pharmaceutical and academic research groups. In an attempt to counter the rapidly increasing costs associated with the discovery of new medicines, new computer-based approaches are conducted. Modern approaches to computer-aided molecular design fall into two general categories. The first includes structure-based methods which utilise the three-dimensional structure of a ligand-bound receptor. The second approach includes ligand-based methods in which the physicochemical or structural properties of ligand molecular species are characterized. A classic example of this concept is a quantitative structure-activity relationship (QSAR) model. Quantitative structure-activity relationships are mathematical relationships linking chemical structures—represented in the form of molecular descriptors—and pharmacological activity in a quantitative manner for a series of molecular species.
Virtual screening is the computational process whereby libraries of existing or virtual molecular species are searched for molecular species that meet well-defined criteria. In general, virtual screening is applied to search for molecular species that might be active against certain disease related proteins, whereas the activity is derived from the calculated interaction between the protein and the molecular species. Scoring of the molecular species is performed using well-defined mathematical functions with the aim to prioritize these molecular species for further analysis. Typically, two major virtual screening tendencies can be distinguished.
The first tendency consists of the protein structure-based approach whereby the potential binding pocket of a protein is used as reference function. The selection of potential binding pockets is still a major challenge within the pharmaceutical industry. Once the reference function is known, one can start with the screening of molecular species having the desired properties with respect to binding to the target protein. A number of suitable methods have been described:                Docking of molecular species within the target protein.        Pharmacophore representation of the binding pocket of the target protein.        
The second tendency consists of a ligand-based approach whereby molecular species with known affinity for a target protein or disease model are used as reference function. A model is derived from these reference molecular species and can be used to annotate other molecular species with respect to their potential binding capabilities. A number of suitable ligand-based approaches have been described. Some are two-dimensional and some are three-dimensional. The 2D methods have the advantage of being applied very efficiently to search molecular databases. The disadvantage is that they are rather unspecific, which is not the case for the somewhat slower 3D methods. These forms of virtual screening can be integrated with available high-throughput screening (HTS) results. Below is provided an overview of a typical ligand-based virtual screening application which is combined with high-throughput screening:                1. Select a training set of molecular species from the HTS results;        2. Train a model based on common characteristics of the selected molecular species;        3. Use the model to score the other molecular species within the database;        4. Validate the prioritized molecular species using the HTS results or by means of new biochemical assay data;        5. Repeat the procedure until convergence of the model has been reached.        
Such modern high-throughput screening platforms requires the implementation and integration of efficient and robust virtual screening protocols and algorithms.
In order to be suitable for use within a computational context, molecular information must be translated into a suitable form, generally called a descriptor. Molecular descriptors can vary greatly in their complexity. A simple example may be a structural key descriptor, which takes the form of a binary indicator variable that encodes the presence of certain substructure or functional features. Other descriptors, such as HOMO (Highest Occupied Molecular Orbital) and LUMO (Lowest Unoccupied Molecular Orbital) energies, require semi-empirical or quantum mechanical calculations and are therefore more time-consuming to compute. Molecular descriptors are often categorised according to their dimensionality, which refers to the structural representation from which the descriptor values are derived. In general, one can classify the current molecular descriptors as one dimensional (1D), two dimensional (2D), or three dimensional (3D).
One dimensional descriptors are a reflection of the ‘bulk’ properties of molecular species, like the molecular weight, the number of atoms, or the molecular distribution between hydrophilic and lipophilic phases. One dimensional descriptors are generally fast to calculate and can be calculated from the molecular composition alone. Nevertheless, one dimensional descriptors lack any information about the molecular connectivity between the atoms, and are therefore rather of limited accuracy when applied to drug discovery and virtual screening problems.
The calculation of two dimensional descriptors requires knowledge of the molecular topology, and comprises information on the presence or absence of well-defined functional moieties, topological distances between well-defined atoms, and information regarding sidechains and ringsystems. Two dimensional descriptors have found their use in chemical similarity analyses and structure-activity relationships, and are useful in complementing three dimensional descriptors. The most widely used two-dimensional descriptors are molecular fingerprints, ‘E-state’ indices, and hologram QSAR descriptors.
Molecular fingerprints are essentially bitmaps consisting of on- and offbits, where each position along the bitmap is assigned to a specific and well-defined molecular fragment. If that particular fragment exists in the molecular species under consideration, then the corresponding bit is set to on, otherwise it is left as off. There are two general methods of 2D fingerprint generation. The first, known as the ‘hashed’ method, uses a set of rules for generating the fragments for fingerprinting. The second method, known as the ‘keyed’ method, requires a priori substructural definitions for all fragments that should be searched for during the fingerprint generation process. Similarity assessments between molecular species based on two dimensional fingerprints can be done in a number of ways, although the most commonly used metrics are based on Tanimoto coefficients. The Tanimoto coefficient compares the number of fingerprint bits in common between pairs of structures.
Electrotopological state (E-state) indices capture both molecular connectivity and the electronic character of a molecular species. The method makes use of the hydrogen-suppressed graph to represent the molecular structure. The focus of the method is on the individual atoms and hydride groups of the molecular skeleton. Intrinsic valence and sigma electron descriptors are assigned to each atom depending on the counts of valence and sigma electrons of the corresponding atoms. From these atom descriptors molecular connectivity indices may be calculated by multiplying the sigma and valence values for each atom in a fragment within a molecular species. This product is then converted to the reciprocal square root and called the connectivity subgraph term.
Hologram QSAR (HQSAR) is another two dimensional descriptor approach in which counts are made of the number of times each fragment is encountered in a molecular species, rather than merely using bitmaps to represent the absence or presence of particular fragments. The resulting integer strings are subsequently hashed to reduce string length and used as input for Partial Least Squares analysis to correlate with biological data.
Three-dimensional descriptors are a reflection of the molecular shape and of the spacial arrangements of the functional moieties which are thought to be important for the interaction between ligand and receptor. As implied by the name, three-dimensional descriptors are generated from a three dimensional representation of molecular species. With very few exceptions, the descriptor values are computed from a static conformation, which is either a standard conformation with ideal geometries generated from programs such as CORINA (Sadowski et al., 1993, Chem. Rev. 7, 2567-2581) or Omega (Boström et al., 2003, J. Mol. Graph. Mod. 21, 449-462), or a conformation that is fitted against a target X-ray structure or a pharmacophore.
An example of three-dimensional descriptor is described in U.S. Pat. No. 5,025,388, which relates to the CoMFA methodology. The CoMFA methodology, which is an acronym for Comparative Molecular Field Analysis, is a 3D quantitative structure-activity relationship technique which ultimately allows one to design and predict activities of molecular species. The database of molecular species with known properties, the training set, are suitably aligned in 3D space according to various methodologies. Charges are then calculated for each molecular species at a level of theory deemed appropriate. Steric and electrostatic fields are subsequently calculated for each molecular species by interaction with a probe atom at a series of grid points surrounding the aligned database in three-dimensional space. Finally, correlation of these field energy terms with a property of interest is performed by means of partial least squares with cross-validation, giving a measure of the predictive power of the model.
The CoMFA method has the inconvenience that it requires the alignment of the molecules of investigation in the same reference frame, which makes the applicability of CoMFA to molecular systems of different structural classes difficult. It also has the inconvenience not to permit the discrimination between stereoisomers, additionally, the descriptors obtained by this method only translate the electronic properties of the molecular species. There is therefore a need in the art for an improved, stereospecific and fast method of generating descriptors from three-dimensional objects by translating a wider range of their properties. There is also a need in the art for such a method not requiring alignment of the molecules under investigation.