A. Field of the Invention
The present invention relates to extracting attributes from sequence strings and from information representing biopolymer materials, and more particularly to a method and apparatus which extracts attributes from information representing biopolymer material to create objects useful for analyzing large amounts of data using multivariate analysis.
B. Description of the Related Art
DNA, RNA, and proteins represent key functional units in biological systems. DNA is composed of nucleotide subunits (deoxyadenosine, deoxythymidine, deoxycytidine, and deoxyguanosine) linked together to form an array of biopolymer material. Often, the linked chain is bound to a complementary chain to form a double helix. The code contained within the DNA is of multiple types. Some sequences within the DNA are recognized by regulatory factors and control how the biopolymer information is expressed. Some sequences encode structural attributes that contribute to the overall use of the biopolymer material. And some sequences encode the RNA or proteins that carry out functions within the cell. For simplicity, DNA is usually represented as an ordered string of the deoxynucleotides (e.g., GATTCTAGGA, (SEQ ID NO:1)), but that simple string reflects the full function of the molecule. The RNA copy of the DNA is also a chain of nucleotides (adenosine, uridine, cytidine, and guanosine being the major ones) (e.g., AUGGACCAUA (SEQ ID NO:2)). Some RNAs are translated into proteins, which are strings of amino acid building blocks.
There are 20 principal amino acid building blocks, and proteins are often represented simply by an ordered string of sequence letters (e.g., MRKLAGQPS (SEQ ID NO:3)). The function of proteins is not, however, fully contained within this simple string, since the building blocks can be modified in multiple ways within a cell. Nonetheless, the sequence of the amino acids is the primary contributor to the function of the protein.
The realm of bioinformatics is largely focused on trying to predict the function of genomic sequences. This work involves comparing the strings of information (genomic sequences), functional properties, and behavior of known and unknown entities, thereby providing a basis for predicting the similar function of sequences with similar properties. These methods, however, are not usually geared toward simultaneous analysis of a large number of sequences. Thus, it is difficult to get an overview of how all the unknown and known sequences relate to each other from these methods.
A number of multivariate analysis methods, including those geared toward data visualization and data mining, are available. In each case, a data object is represented as a high-dimensional vector, where the number of dimensions is equal to the number of independent attributes required to describe the data object.
For data strings, such as genome sequences, however, there are relatively few methods that have been applied to represent the information as a high-dimensional vector. One method creates a signature for protein sequences based on the occurrence of all possible amino acid dimers (or pairs of amino acids). See van Heel M., A New Family of Powerful Multivariate Statistical Sequence Analysis Techniques, 220 J. Mol Biol 877 -887 (1991). Application of this method with 20 amino acids resulted in a 20×20 or 400-dimensional representation for each protein for comparison using cluster analysis.
Another method also includes information about individual amino acids (composition) and descriptive information such as length of the sequence and pi (isoelectric point). These composite vectors were then used for searching data sets to identify similar sequences. See Hobohm U. and Sander C., A Sequence Property Approach to Searching Protein Databases, 251 J. Mol. Biol. 390-399 (1995).
While the goal in creating vectors in the above methods was to create a surrogate for functional information in the proteins, these methods do not provide sufficient discrimination to represent the subtle differences between most genomic sequences.
A different approach for mathematical representation of sequences for multivariate analysis is to use an ordination method. See Higgins D. G., Sequence Ordinations: a Multivariate Analysis Approach to Analyzing Large Sequence Data Sets, 8 Comput. Appl. Biosci. 15-22 (1992). Such a method uses the square root of the percentage difference between two sequences as a Euclidean distance. Then, each protein is represented within a distance matrix derived from all comparisons. The usefulness of percentage differences as a distance measure, however, is limited to closely related sequences. U.S. Pat. No. 5,930,784 to Hendrickson, issued Jul. 27, 1999, provides an example of using geometric distances among all items in a data set for data mining.
These methods are quite limited, however, when comparing proteins with limited similarity or when analyzing a large number of proteins simultaneously.