1. Field of the Invention
The present invention relates generally to data mining, and, more specifically, to software tools that analyze similarity amongst descriptors of items contained in a database.
2. Description of the Related Art
In a large number of molecular biology tasks descriptors are used to characterize and quantify properties for the item or items of interest. For example in the simplest form of gene expression analysis, a gene""s behavior is observed over time and its level of expression tracked as environmental conditions change; many and sometimes all of the genes of a given organism are tracked over time. In the more traditional drug discovery, descriptors (values that characterize structure, charge or other property) for atomic complexes, such as molecules or portions of molecules, form the basis for the selection of potential drug candidates.
Although these are only two examples but they highlight the nature of the problem. Each item of interest (e.g. gene, gene fragment, protein fragment, DNA fragment, atomic complex etc.) is characterized by a set of observable properties (e.g. expression level as a function of a changing environment, shape, mass, inertia, the steric and electrostatics fields it generates, interaction affinity etc.) that are inherently related to its identity or composition. Over the years, various representation schemes have been proposed that attempt to capture and represent some of these observable properties by implicit or explicit calculations carried out on one or more components of the item of interest. As a rule of thumb, there exist no universal representation schemes. Moreover, a given representation scheme typically provides an approximate description of the observable property (-ies) it attempts to capture. For example, and in the general case, the electrostatic field outside of an atomic complex can be described completely with the help of an infinite number of moments (i.e., point charge, dipole, quadrupole, octapole, hexadecapole, and so on); however, low order approximations are typically used that provide a less accurate approximation, but easy to compute.
Given a collection of items of interest (e.g. genes of an organism, gene fragments, protein fragments, DNA fragments, atomic complexes that comprise molecules or molecule fragments) and a representation scheme, the discovery process entails an educated selection among the descriptor set of the items. This selection invariably involves comparisons of and is dependent and based upon the representations that have been chosen for the items under consideration. In the discussion that follows, we will be drawing our examples from the specific problems of gene expression analysis and rational drug design but the described method and system are applicable to any situation where one is given a database of items to be studied with each of the items being represented by a set of descriptors and what is sought is the discovery of collections of items that have values which agree across one or more descriptors.
Software tools have been developed to aid in the discovery process. These tools generally provide the user with the ability to specify a particular descriptor, and then identify the number of items in the set being studied that have the same quantized value for a particular descriptor. In addition, such tools typically provide the user with the ability to specify two or more descriptors, and then identify the number of items in the set being studied that have the same quantized value for the specified descriptors. Alternatively, one can think of the input that such tools process as a two dimensional array of values (the number of rows is equal to the number of items in the database (e.g. genes, gene fragments, protein fragments, DNA fragments, atomic complexes, etc.) while the number of columns is equal to the number of descriptors used by the representation scheme): after one or more columns have been identified, the tool will determine and report to the user those of the database items (e.g. genes, atomic complexes etc.) whose values agree across the entire set of identified columns. Occasionally, the user is also required to specify a minimum support, i.e. the minimum number of atomic complexes that must share the same collection of values across the specified columns. An example of such a tool in the context of rational drug design is DIVA, distributed by Oxford Molecular, additional information regarding this tool may be found at www.oxmol.com. The two-dimensional table relating database items with their attributes is a specific manifestation of what has been called the binary relationship in the context of Galois Concept lattices. A binary relationship consists of explicitly stating the attributes(descriptors) of a group of objects (genes, molecules, etc.). The present work is a unique tool in providing in a most efficient manner the entire hierarchy of descriptor correspondences and the support for each such correspondence. Namely, how many objects (genes, gene fragments, protein fragments, DNA fragments, atomic complexes, etc.) agree with respect to every conceivable subset of descriptor correspondences.
To stress the generality of the problem as we have just defined it, we wish to note that the following database comprising 5 items that are represented using a set of 6 descriptors
can be equivalently represented
where {e1, e2, e3, e4, e5, e6, e7, e8, e9, e10, e11, e12, e13, e14, e15, e16, e17, e18, e19} are user-defined symbols, without any loss of information. Correspondences discovered on input of this type are typically thought of as associations and the problem of discovering these corresponsdences is also referred to as association discovery.
The problem described above is computationally very demanding and although it is tractable for small collections of database items, its computational requirements become prohibitive very quickly independent on whether the user drives the discovery process or the computer performs a brute force exhaustive enumeration. Let us first examine the case where the user drives the discovery. Every time the user identifies a collection of descriptors of interest the user has made a hypothesis about a potential relationship (i.e. association) among the descriptors. The computational tool will then process the input in order to verify whether the user""s hypothesis is supported by the input set. Each hypothesis that is corroborated by the computational tool gives rise to a collection V of atomic complexes that is then examined in order to determine its quality: this can be achieved easily by examining the observable properties of each of the V reported complexes to determine an agreement; if successful, a potential causal relationship is formed that leads from a specific choice of values for the identified set of descriptors to the intersection of observable properties of the respective set of reported database items; the descriptors in the specified set that led to a verified hypothesis are then thought of as being xe2x80x9cassociatedxe2x80x9d or xe2x80x9ccorrelated.xe2x80x9d These formed causal relationships are further analyzed for validity using other means of investigation.
If a hypothesis proves to lead to no such potentially promising results, the user identifies a new collection of descriptors and lets the computational tool repeat the processing. In the absence of any task-specific knowledge, the user would have to repeatedly identify sets of descriptors that the tool will need to process.
As an alternative to the above described process, other currently available methods proceed by a brute force enumeration: collections comprising increasingly more descriptors (one descriptor, two descriptors, three descriptors, etc.) are hypothesized by the computational tool this time (instead of the user) and as above checked to determine whether they are supported by the input set. In other words, instead of the user making a potentially educated selection among the available set of descriptors, the computer enumerates over all possible such selections.
Obviously, and independent of whether it is the user or the computational tool that is driving the descriptor selection process, up to 2D many descriptor choices may have to be made (i.e. choices involving exactly two, three, four, five, . . . , Dxe2x88x921, and D descriptors) by the user before any promising result can be reported. Typically, only promising such choices may be made effectively pruning the number of possibilities and speeding up the process. But in the case of a general input these approaches are computationally very demanding.
Because of the evident computational complexity, some of the currently available tools leave the decision process to the user who has to explicitly select the descriptors that are of interest instead of automatically attempting to process the descriptors for the complexes under consideration. It should be noted that the user""s selections are typically independent of whether the processed data set supports agreements amongst the selected descriptors.
But even when a collection of C descriptors has been identified that the computational tool indicated as being supported by a subset of the input comprising S database items, it is still not clear whether these C descriptors are the only descriptors that are shared by the S items . In other words, it is unclear whether any of these collections is maximal with respect to the available set of descriptors. It should be stressed that it is important that any reported descriptor collections be maximal both in terms of composition and of length (i.e. no additional descriptors can be found which are shared by the S items under consideration) since only then will they be most specific. The reason for this last requirement is related to the obvious relevance that the most specific collection of descriptors for these S items has in determining a potential causal relationship thatleads from a choice of values for some descriptors to shared observable properties.
It should be evident at this point that there remains a need in the art for a mechanism that efficiently generates an extensive and exhaustive list of correspondences (agreements or associations) amongst the descriptors of a set of database items.
The problems stated above and the related problems of the prior art are solved with the principles of the present invention, method and apparatus for exhaustive enumeration of similarity in a set of descriptors. The present invention utilizes a set of D descriptors for each of N items. A value Kxe2x80x2 representing a number of descriptors, and a value K for a number of items that should support an hypothesis are generated, preferably via user input. Collections involving K or more items for which there is an association involving a selection of values across at least Kxe2x80x2 of the D descriptors are identified, and preferably reported to the user.