The present invention generally relates to data comparison methodologies and, more particularly, to methods and apparatus for performing an alignment free and rapid determination of differences between a test data set, e.g., a probe protein sequence, and known data sets, e.g., a given protein sequence data base.
It is known that there are various deficiencies in existing methods associated with certain applications in bioinformatics and computational genomics. These deficiencies are of increasing concern as the level of genomic data mounts. Of particular concern are existing algorithms designed to detect similar protein sequences in a known database given a test sequence. For example, many of such existing algorithms implicitly or explicitly require a sequence alignment operation despite the fact that it is desirable to avoid such an operation. Further, none of the existing algorithms are particularly well suited for efficient utilization on a relational data base. Still further, current protein sequence search and sequence-pattern recognition methods depend on very sophisticated mathematical methodologies.
Examples of such existing methodologies include Smith-Waterman (M. S. Waterman, xe2x80x9cIntroduction to Computational Biology,xe2x80x9d Chapman and Hall, London (Smith-Waterman), 1988), BLAST (S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, J. Mol. Biol., 215, 403-410 (BLAST), 1990), FASTA (W. R. Pearson and D. J. Lipman, Proc. Nat. Acad. Sci.uS, 85, 2444-2448 (FASTA), 1988). While the above methodologies are approaches inherently based on, and aiding, protein sequence alignment in order to abstract common features, more recent approaches such as TERESIAS and SPLASH eschew the need for alignment by directly abstracting common patterns, see, e.g., I. Rigoustos, A. Floratos, and C. Ouzounis, IBM RC20803(92166), Apr. 26, 1976; A. Califano and I. Rigoustos, In Proc. Symp. on Intelligent Systems for Mol. Biol., Washington; G. Stolovitsky and A. Califano, xe2x80x9cDiscrete Applied Mathematics Series,xe2x80x9d ed. P. Penver; and I. Rigoustos and A. Floratos, In Proc. 2nd Annual ACM Intnl. Conf. of Comp. Mol. Biol., 1993. By assembling directories of patterns in known protein sequences and pointers to their source proteins, the latter three approaches can also be used to sensitively identify protein sequences which are related. However, these latter approaches require the derivation, continued updating and use of large dictionaries of patterns or xe2x80x9cmotifs.xe2x80x9d Also, none of the approaches are especially well suited to relational databases.
The present invention provides methods and apparatus for performing an alignment free and rapid determination of differences between a test data set and known data sets. In a broad aspect of the invention, a method for generating data characterizing an item described by an ordered string of characters, comprises the steps of: (i) for a set of separation metrics each representing a unique number of positions of separation between arbitrary characters in a character group in the ordered string of characters, associating first with each separation metric; generating a set of character groups, wherein each character group comprises at least two characters contained within the ordered string of characters; and (ii) for at least one given character group in the set of character groups, for each given separation metric in the set of separation metrics, generating second data representing number of occurrences that the given character group satisfies the given separation metric; generating third data associated with the given character group, wherein the third data is based upon the second data and the first data; and storing the third data in memory for subsequent use.
Preferably, said item is a protein sequence, and each character in the ordered string of characters represents at least one amino acid. Further, each character group may comprise a character pair contained within the ordered string of characters, and wherein each separation metric represents a unique number of positions of separation between arbitrary characters in a character pair in the ordered string of characters. The third data may be stored in an entry associated with the given character pair, wherein the entry is part of table of entries stored in memory, wherein each entry corresponds to a unique character pair. The first data may comprise a set of unique prime numbers corresponding to set of separation metrics. The set of unique prime numbers may be an ordered set beginning with the number 2. The step of generating the third data associated with the given character pair may be generated by: for each given separation metric in the set of separation metrics, transforming the second data using a function based upon the unique prime number associated with the given separation metric; and adding the transformed first data to a running sum associated with the given character pair. The transforming step may comprise multiplying the second data by the natural logarithm of the unique prime number associated with the given separation metric.
Further, the method may provide that the table entries for a given item be compared with the table entries for another item to provide a measure of similarity between the two items. The comparison between table entries for the two items is preferably based upon the numerical difference between corresponding table entries. Still further, the measure of similarity between the two items is preferably computed by: summing the absolute value of the numerical difference between corresponding table entries, and normalizing the resultant sum.
In a specific embodiment in the context of genetic information processing, the present invention provides a methodology of representing protein sequences or parts thereof by a xe2x80x9cfingerprintxe2x80x9d or xe2x80x9crecognition silhouettexe2x80x9d which identifies a protein as belonging to a particular class. To achieve this, the present invention enables rapid estimation of degrees of relatedness in a form of xe2x80x9cpattern contentxe2x80x9d sense, which also has more specific routine applications. The particular significance is that a data base searched can consist of fingerprints representing whole sequences or parts of sequences (including segments of sequences sometimes called xe2x80x9cblocks,xe2x80x9d domainsxe2x80x9d or xe2x80x9csub-domains,xe2x80x9d and such fingerprints may serve as an alternative to storage of specific sequence, or can represent whole families, or recreate parts of proteins such as domains or even relatively small fragments and consensus sequences.
To do this, pattern matching and its application in genomics to searching for homologousprotein sequences, is expressed in some of the concepts of prime number theory. In accordance with the invention, patterns can be seen as recurrent prime factors which one may deem irreducible in string data, or relatively prime in the context of the problem addressed, and their isolation from the embedding string data corresponds to factorization. Though this may be simple analogy rather than a xe2x80x9cdeep relationship,xe2x80x9d comparison leads to some useful preliminary screening tools for showing the relationships between two sequences of symbols.
To do this, patterns in a special representational form are identified. This special representational form is referred to as a xe2x80x9cfinger matrix.xe2x80x9d A simple mapping by pattern data to prime numbers is defined, and is used, in turn, to encode multiple symbol-separation data into the simple scalar elements of a 20xc3x9720 residue (amino acid type) array, in one embodiment, this matrix being characteristic of individual protein sequences (assuming a sliding window of 10 residues).
Generally, a homology refers to a similarity, likeness, or relation between two or more sequences or strings. However, as will be explained, the invention provides homology detection through a methodology of identifying differences between a given sample and a known data base. In a specific application, the method is employed to measure the differences between sequences, rather than looking for common features, and thus is very different from other methods, especially TEIRESIAS and SPLASH, and with a different purpose. This is because the invention was also in part developed to obtain estimates of the effort required (or xe2x80x9cscope of problemxe2x80x9d) in modeling a protein from a homologue of known conformation. To that end, the difference between the finger matrix (as the pattern representational form) is measured, such that the points of sequence difference are counted. While that aspect of emphasizing differences must be born in mind, implying a non-classical kind of homology metric, the matrices have utility as a finger matrix or xe2x80x9cfingerprintxe2x80x9d characteristic of specified protein sequences, and the difference matrix for two specified sequences is a measure of percent homology by classical definition. This relation with classical measure holds when homology is distributed throughout the length though not necessarily between segments in the same order, and it is not affected by insertions or deletions with one sequence with respect to the other. Importantly, therefore, alignment is not required. The finger matrices optionally can be calculated in advance for a protein sequence data bank, allowing very high speed matching and screening prior to more diligent analyses. The nature of the finger matrix also lends itself readily to treatment of sets, including fuzzy sets, of amino acid types, and definitions of sets can be introduced at the time of comparing or subtracting finger matrices.
It is to be appreciated that the methodologies of the invention may be employed in various other applications. For example, the present invention finds application in determining relationships associated with deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). More generally, the invention may be used in any situation involving a series of text or symbols, or data which may be converted to text or symbols, in order to look for relationships associated therewith. By way of example only, data may be processed in accordance with the present invention which represents peaks in graphs, features such as mountain ranges, and fluctuations in the stock exchange. Given the inventive teachings provided herein, one of ordinary skill in the art will realize other and varied applications of the invention.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.