Table data structures, and particularly tables in database management systems, are used to store large amounts of data. The demand for efficient data storage for a variety of data intensive applications continues to grow. However, for many such data intensive applications, table data structures have been assumed to be an inappropriate mechanism for storing much of the data generated or obtained by those applications. Furthermore, there appears to be little appreciation that the paradigms associated with table data structures would be very useful in those applications.
For instance, the web pages downloaded during a crawl of the World Wide Web (WWW) are typically stored as a set of files, not as entries in a table or a database table. Similarly, RSS feeds (which can be considered to be documents) downloaded by RSS feed aggregators and other computers are typically stored as files. Storing web content, RSS feed content and the like as individual files is traditional and the mechanisms for managing and accessing such files are well established. The present invention, on the other hand, provides a data model and a set of data management processes and mechanisms for storing large amounts of data in one or more table data structures, thereby providing an alternative to the traditional model of storing individual content items in individual files.
A challenge associated with storing massive amounts of data in table data structures is achieving efficient compression, and very high speed decompression of compress data that is infrequently, if ever, updated. Numerous data compression methods for individual files are well established. However, when the stored data includes data items having numerous versions, some of which may be identical or highly similar to each other, as well as distinct but related data items that may also have significant amounts of overlapping or shared content, there is a need for new data compression and decompression processes that take advantage of the large amount of redundancy in the data, that provide good compression despite the fact that some of the data items may be large (e.g., multiple megabytes in length) and thus have common content that is separated by large “distances” in memory, and that provide extremely fast decompression so that access to the compressed data is not significantly hampered by its storage in compressed form.