Recognizing patterns within a set of data is important in many fields, from speech recognition, image processing, seismic data, etc. Some image processors collect image data and then pre-process the data to prepare it to be correlated to reference data. Other systems, like speech recognition, are real time where the input data is compared in real time to reference data to recognize patterns. Once the patterns are “recognized” or matched to a reference, the system may output the reference. For example, a speech recognition system may output equivalent text corresponding to input speech patterns. Other systems, like biological systems may use similar techniques to determine sequences in molecular strings like DNA.
In some systems there is need to find RPs that are imbedded in a continuous data stream. In non-aligned data streams, there are some situations where occurrences of the RP may be missed if only a single byte by byte comparison is implemented. The situation where RPs may be missed occurs when there is a repeated or nested repeating substring patterns in the input stream or the IP being matched. An RP containing the desired sequence is loaded into storage where each element of the sequence has a unique address. An address register is loaded with the address of the first element of the RP that is to be compared with the first element of the IP. This address register is called a “pointer.” In the general case a pointer may be loaded with an address that may be either incremented (increased) or decremented (decreased). The value of the element pointed to by the pointer is retrieved and compared with input elements that are clocked or loaded into a comparator.
FIG. 1 illustrates an IP 101 and an RP 102 where letters are used to illustrate data. The problem is to determine if the sequence “ABCD”, found in the RP 102, appears in the sequence “ABCABCDAC” of IP 101. One technique for doing this is to have a PTR 108 that cycles through the RP 102 on a byte per byte basis. This is shown in the frames 103–107. Sequence 109 represents clock cycles relative to reading IP 101. At clock cycle 1, the input is an “A” and PTR 108 points to the first entry “A” in the sequence “ABCD” of the RP 102 for comparison. In this case, there is a match so the PTR 108 moves to the second entry in the RP 102 (B) to determine if it matches the letter “B” in the IP 101 read at clock cycle 2. In this case, there is again a match. PTR 108 moves to the third entry (C) in the RP 102 to determine if it matches the letter “C” in the IP 101 read at clock cycle 3. Through three clock cycles, the sequence ABC in RP 102 matches the sequence ABC in IP 101. PTR 108 is incremented and moves to the fourth and last entry “D” in the RP 102 which is compared to the letter read “A” in the IP 101 at clock cycle 4. At clock cycle 4, the sequence “D” of the RP 102 does not match the “A” of the IP 101. Since there is no match, PTR 108 returns to the first entry (A) of RP 102 to repeat the process of trying to find the entire RP “ABCD” in the IP. At clock cycle 5, the input is a “B” and this does not match the first entry “A” of the RP 102 accessed by PTR 108. The start of the sequence “ABCD” in IP 101, which starts at clock cycle 5, has been missed. Using a single PTR 108 requires the pointer to be pointing at the first entry in the RP when the same sequence is starting in the IP 101. Because of this, there is no guarantee that all imbedded patterns in an input data stream that match an RP will be found.
There is, therefore, a need for a method and an apparatus to ensure that imbedded patterns in an input data stream are not missed because the position of pointer to the RP does not coincide with the start of the desired pattern in the input pattern.