A. Field of the Invention
The present invention relates generally to storage, indexing, and retrieval of database sequences, and more particularly to a method and system for generating and searching a tree-structured, index of window vectors that represent database sequences.
B. Background of the Invention
Sequence-similarity finding programs identify sequences in DNA and protein databases that are similar to a query sequence. Because of the recent explosion in the amount of DNA sequence information available in public and private databases as a result of the human genome project and other large-scale DNA sequencing efforts, such sequence-similarity finding programs have become increasingly important in modem biology.
Generally, there are two classes of sequence-similarity searching programs: global comparison methods (e.g. Needleman and Wunsch), and local comparison methods, e.g., the FASTA method (Pearson and Lipman), and the BLAST method (Altschul). Global comparison methods have a high degree of accuracy but are extremely slow. Local comparison methods such as FASTA and BLAST identify candidate similar sequences based on shared k-tuples and therefore are faster than global methods. However, local comparison methods are less accurate, i.e. they provide results which are not similar. Moreover, with both the global and local comparison methods, the computational complexity of such methods usually increases linearly with the size of the number of sequences to be searched. This is due in part to the fact that most prior art methods search all, or at least a very large part, of the sequence database. To improve the searching time, several prior art methods have created sequence databases using clusters of sequences, including tree-structured indexes. However, many of these methods use conventional sequence alignment methods such as BLAST and FASTA to determine pairwise distances between the sequences. Thus, such cluster and tree-structured methods are also limited by the speed of the alignment methods described above.
Accordingly, what is needed is a system and method for organizing and searching database sequences that is fast and efficient, and at the same time provides a high degree of accuracy, that is, one that identifies sequences similar to a query sequence.
The present invention overcomes the limitations of conventional sequence-similarity-searching programs by using window vectors that represent database sequences in a sequence storage and retrieval system. The window vectors associated with the database sequences are organized into a tree-structured index for faster and more efficient searching of the database sequences. A query sequence is used to search the tree-structured index for database sequences that are similar to the query sequence.
In one embodiment, each database sequence is partitioned into a plurality of overlapping windows or fragments of fixed length. Each database sequence window has a fixed length W and the degree of offset between windows is determined by a parameter xcex94. Each database sequence window comprises a subsequence of elements from the database sequence beginning at position j*xcex94, from the start of the database sequence, and ending at position j*xcex94+W, where j=0, 1, 2, . . . (Lxe2x88x92W)/xcex94. In other words, each database sequence window has a fixed length W which is advanced down the length of the original database sequence every xcex94 elements.
For each database sequence window, a database sequence window vector is computed. A database sequence window vector represents the occurrence of each k-tuple in the database sequence window. In one embodiment, the database sequence refers to a DNA or protein sequence, and the occurrence of each k-tuple in the DNA or protein sequence window is represented by a vector of length 4k, Each position in the vector represents a unique k-length sequence (i.e. k-tuple). If a k-tuple occurs more than one time in a database sequence window, either the number of times that the k-tuple occurs may be recorded in the corresponding position in the vector or the value xe2x80x9c1xe2x80x9d may be recorded in the corresponding position in the vector to indicate that the k-tuple occurs at least once in the database sequence window. If a k-tuple does not occur in a database sequence, a zero in the corresponding position in the vector may be used to indicate that the k-tuple does not occur in the sequence. In one embodiment, database sequence window vectors are stored in a tree-structured index to reduce searching time.
Database sequences are retrieved using a query sequence. In one embodiment, a query sequence is partitioned into a plurality of windows. For each query sequence window, a query sequence window vector is computed. A query sequence window vector represents the occurrence of each k-tuple in the query sequence window. Each query sequence window vector is compared against the tree-structured index of the database sequence window vectors to locate the nearest neighbors of the query sequence, i.e. database sequences similar to the query sequence. In other words, for each query sequence window vector, the tree-structured index is traversed from a root node of the tree to a terminal node which contains the nearest neighbor for that query sequence window vector. In one embodiment, the list of sequences with at least one significant window hit is returned.
In one embodiment, the present invention is used to generate and search a tree-structured index of window vectors that represent biological database sequences. Each database sequence may represent a DNA sequence comprising a fixed number of nucleotides. The DNA database sequence is then partitioned into a plurality of overlapping windows. Each DNA database sequence window has a fixed length W comprising a fixed number of nucleotides, and the degree of offset among windows is determined by a parameter xcex94. Each DNA database sequence window is then mapped into a database sequence window vector. The DNA database sequence window vector indicates the frequency of appearance of each k-tuple in the corresponding DNA database sequence window. A tree-structured index is then generated using the DNA database sequence window vectors. To search the tree-structured index, a query sequence (e.g. a DNA sequence) is partitioned into a plurality of windows. Each query sequence window is then mapped into a query sequence window vector. Each query sequence window vector is then compared against the tree-structured index to locate the DNA database sequence window vectors which are closest to the DNA query sequence window vector. The list of DNA database sequences that is similar to the DNA query sequence is then returned.