The invention relates to computer systems, and more particularly to a method and mechanism for implementing compression in a computer system.
Data compression is a commonly used technique in many modem computer systems. One advantage that is provided by compressing data is the reduced costs for storing data onto storage mediums. Another advantage that is provided by compression techniques is an increase in I/O and transmission efficiency by reducing the amount of data to be sent/received between computing entities or to/from storage devices. The acts of compressing and decompressing data themselves consume a given amount of overhead that is often related to the specific compression algorithm being used and the quantity of data being compressed/decompressed.
A common approach for implementing compression is to compress data at the granularity of the object or file. For example, traditional compression approaches such as the Unix-based gzip or the DOS-based zip commands compress an entire file into a more-compact version of that file. A drawback with this type of approach is that if an entire file is compressed, all or a large part of the file must be decompressed before any part of it can be used, even if only a small part of the file is actually needed by a user. This is a problem that particularly exists with respect to compressing files in database systems, in which a single database file may contain large quantities of database records, but only a small portion of the individual records may be needed at any moment in time. Thus, the granularity of compression/decompression may not realistically match the granularity at which data is desirably used and accessed in the system. Moreover, compression granularities for traditional compression algorithms could result in storage inefficiencies. For example, page-at-a-time compression approaches could lead to compressed pages of different sizes that are inefficiently mapped onto physical pages.
Another approach is to employ content-specific or language-specific granularities when compressing data. In a database context, this approach allows compression and decompression at the level of a tuple or level of individual fields/columns of a database object. In implementation, the “language” layer of a computer system (e.g., the computing layer that processes Structured Query Language or SQL commands in a database system) can be modified to perform compression or decompression based upon the known structure or schema of the data. An advantage with this approach is that smaller granularities of data can be decompressed when accessing data, rather than requiring an entire file of data to be decompressed to access a small portion of the desired data records. However, this approach requires the compression scheme to be directly influenced and possibly specific to a particular data schema used to organize the data. This can significantly affect the maintainability of that data, since the compression scheme may necessarily require updating when a change occurs to the corresponding data schema, e.g., the compression scheme changes if modifications are made to the type, number or order of fields in a database table. The query operators may also need to change if there is a change to the compression scheme or if the data is changed from a compressed state to an uncompressed state, or vice-versa.
Embodiments of the present invention provides a method and mechanism for implementing compression in a computer system. In one embodiment, each granular portion of a file can be individually stored in either a compressed storage unit or in an uncompressed storage unit. The storage units can be allocated apriori or on an as-needed basis. In one embodiment, a directory structure is employed to track storage units for a file. Further details of aspects, objects, and advantages of the invention are described below in the detailed description, drawings, and claims.