An index is a list of key elements and associated information that points to a location containing more comprehensive data. A book index, for example, contains word entries and associated page numbers pointing to the detailed information in the book. In the electronic realm, indexes are used to locate particular files of data entries in a data storage system. The amount of indexing memory above and beyond that required to store the original text or data will be referred to as "memory overhead." The amount of time required to find a particular sequence in the data or text will be referred to as the "time overhead."
Various techniques exist to reduce the memory overhead. An obvious approach is to store no index at all. The text is simply scanned serially for any pattern desired. This technique and related methods require access time which grows linearly with the size of the text. As the text size doubles, the typical time required to find a pattern likewise grows twofold. Indexing schemes, such as a conventional book index provide much faster access but with memory overhead which grows linearly or faster with the size of the text. If the size of the text doubles then the index likewise grows twofold.
Accordingly, there is a need for an indexing scheme which provides both smaller than linear time overhead and smaller than linear memory overhead.
One well known and useful data structure is a sorted list of all data records. This kind of data structure has applications ranging from data storage systems to pattern matching algorithms. For unstructured text, this can correspond to representing a sorted list of all suffixes of the text stream or all rotations of the text. A "rotation" of a sequence of characters is a new sequence created by repeatedly taking the first character and placing it at the end of the previous sequence.
For example, consider the ten character sequence "test.sub.-- text#." Rotating the original sequence all possible times yields the following ten rotations: