Computer databases for storing full text indexes have become common for text storage and retrieval. These databases enable a user to search the index for particular data strings within the stored text. Typically, the index data is stored in a data structure separate from the text data of the database and, therefore, constitutes memory overhead. The memory overhead is justified since the index enables the user to quickly search the text data for the desired data string. However, it is desirable to minimize the memory overhead required for the index.
Many prior art methods provide an index by identifying each data string and associating with the data string an identifier of each location within the database that the data string appears. These indexes are obviously cumbersome and utilize a large amount of memory overhead. In similar fashion, other prior art methods using such indexes also use data compression techniques to reduce the memory overhead required. Nonetheless, these methods require memory for the index equal to between 50% and 100% of the memory required for the database, i.e., 50%-100% overhead.
Other methods for providing a text index have assigned codes to certain data sequences whereby the data sequence can be indexed as discussed above. Although this method works well for databases that exhibit strong patterns in data sequences, the method is not acceptable for databases having relatively few patterns in data sequences. Therefore, it is desirable to provide a method and apparatus for storing full text indices wherein the memory overhead required for the index is less than 20% of the storage required for the database.