1. Field of the Invention
The present invention relates to pattern matching, and more specifically, to pattern matching in streams of characters. Even more specifically, the invention, in its preferred embodiment, relates to methods and system for pattern matching in biological sequences represented as streams of amino acid or nucleic acid codes.
2. Background Art
Recent years have witnessed an increased focus on creating methodologies that can lower the cost of genomic sequencing while increasing throughput. There are several methods for high throughput sequencing that already are (or will soon be) commercially available through companies such as 454, illumina, Helicos, and others.
The specific methodology notwithstanding, a typical output of a high throughput sequencing run is comprised of a long list of ‘reads’. Each read corresponds to a fragment of sequence from the DNA (or RNA) that is analyzed. A list of such reads can contain from a few hundred thousand to several million entries. For the sake of simplicity, in what follows, ‘read’ is taken to mean the ‘payload’ sequence, i.e., a sequence that is devoid of 5′ and 3′ linkers. Each such read can be through of as a sequence S whose length LS can vary. The sequence S comprises letters selected from an alphabet Σ of possible letters. For example, in the case of DNA, four possibilities exist: Σ={A, C, G, T}. As part of the sequencing process, each position within the sequence S is associated with a quality measure that estimates the confidence in the letter that is being reported for that location within S: if the quality value that is associated with a given position S falls below a threshold, then the corresponding letter is likely to represent a ‘sequencing error’. In light of such sequencing errors, one would like to determine the location with the genome at hand that gives rise to the sequence S. One way of handling this problem is to replace those positions of S with low quality estimates by a ‘wild card’ that can match any (exactly one) of the allowed alphabet Σ. In the general case, it can be assumed that enough information may be available to restrict the possible candidates at an affected position: in this case, the candidates are denoted using a bracketed expression such as [ACT] which means ‘a choice of exactly one letter among A, C and T’; similarly, [AT] means ‘either an A or a T’, etc.
For example, let S=CAAAAGACGAGGGTCTCAGGAAAAACC and let the underlined letters be the ones corresponding to low confidence values. If each of the presumed ‘sequencing errors’ is replaced by either a wild card, denoted by ‘.’, or a bracketed expression, a new sequence S′ is obtained. One such sequence S′ could be, for example, S′=C.[AT].AG.CGAGGGTC[ACG]CAGGA.[GT]AACC. If this operation is repeated for each of the numerous sequences in the list of reads of a typical run, a list of patters is generated with ‘rigid gaps’—captured by the various wild cards and bracketed expressions—whose counterparts in the genome at hand need to be identified. In a realistic setting, one will be presented at this stage with a collection of tens of thousands of patterns that may or may not contain rigid gaps and which will need to be located in a target genomic sequence. If a pattern has multiple instances in the genomic sequence, all such instances will need to be identified and reported. In the general case, the patters will have variable-lengths.
Even though a specific context is used to introduce it, the problem of quickly locating in a target database all instances of a potentially large collection of variable length, rigid patterns, containing wild cards and bracketed expressions arises in many settings. The present invention provides a method for solving this problem.