The invention relates to computer systems, and more particularly to a method and mechanism for storing and retrieving data in a computer system.
As many modern businesses and organizations continually increase their need to access greater amounts of information, the quantity of data that must be stored in databases and computer systems likewise increases. A significant portion of the expense for storing a large quantity of information is related to the costs to purchase and maintain data storage systems. Given this expense, approaches have been suggested to reduce the amount of space that is needed to store a given quantity of data.
Data compression is a suggested technique in many modern computer systems to reduce the storage costs for data. A common approach for implementing compression is to compress data at the granularity of the file. For example, traditional compression approaches such as the Unix-based gzip or DOS-based zip compress an entire file into a more-compact version of that file. A drawback with this type of approach is that if an entire file is compressed, all or a large part of the file must be decompressed before any part of it can be used, even if only a small part of the file is actually needed by a user. This is a problem that particularly exists with respect to compressing files in database systems, in which a single database file may contain large quantities of database records, but only a small portion of the individual records may be needed at any moment in time. Thus, the granularity of compression/decompression may not realistically match the granularity at which data is desirably used and accessed in the system.
Moreover, compression granularities for other traditional compression algorithms could result in storage inefficiencies. For example, certain page-at-a-time compression approaches could lead to compressed pages of different sizes that are inefficiently mapped onto physical pages. Furthermore, many traditional compression techniques do not even guarantee that data size will not increase after compression.
In addition, the very acts of compressing and decompressing data could consume an excessive amount of overhead. The overhead is typically related to the specific compression algorithm being used as well as the quantity of data being compressed/decompressed. This overhead could contribute to significant latency when seeking to store or retrieve information in a computer system. Given the latency problem as well as less-than-certain compression gains, the trade-off between time and space for compression is not always attractive in a database or other type of computing system.
Embodiments of the present invention provide a method and mechanism for implementing storage and retrieval of data in a computing system. According to an embodiment of the invention, data compression is performed on stored data by reducing or eliminating duplicate values in a database block or other storage unit. In this embodiment, duplicated values are eliminated within the set of data that is to be stored within a particular data storage unit. Rather than writing the duplicated data values to the data storage unit, the on-disk data is configured to reference a single copy of each duplicated data value through a symbol table. Because only duplicated data values are removed, and data values are not individually subject to potentially useless data compression algorithms, the invention can be configured to ensure that the on-disk data size will not exceed the original data size at the expense of a single structure or bit in a block header. Moreover, since such a reference to a symbol table is all that is required to access duplicated data, data access is not significantly impaired in this approach. Also disclosed for an embodiment is recursive referencing of values in the symbol table. Column reordering may be performed in an embodiment to further improve compression efficiency. The column reordering may be performed to allow efficient removal of trailing NULL values from on-disk storage. Further details of aspects, objects, and advantages of the invention are described below in the detailed description, drawings, and claims.