1. Field of the Invention
The present invention relates to data compression in data storage systems. More particularly, the invention concerns a method, apparatus, and article of manufacture for analyzing data compression efficacy and modifying data compression in accordance with the results of such analysis, where the analysis and regulation is done on-line.
2. Description of the Related Art
Many data storage systems achieve improved storage efficiency by employing data compression. Rather than simply storing data exactly as received from a user, data can be stored in a compressed format. Often, this compression is achieved by substituting shorter codes for lengthier data that frequently occur in the database. As a simple example, each occurrence of the address “1000 Maple Street” in a database may instead be represented in the database by “*”. The stored database is therefore considerably shorter, since each occurrence of “1000 Maple Street” is reduced to “*”. Translations between expanded data and compressed codes are stored in a compression/decompression dictionary, known to those in the art simply as a “dictionary”.
In many known applications, data compression techniques are successfully applied to databases and their log records, significantly increasing the data storage efficiency of these applications. In some cases, however, implementation of known data compression techniques fail to provide the theoretically envisioned level of data storage efficiency.
This can occur for a number of reasons. For example, the dictionary may be created based upon a subset of interrogated data, which turns out to be a poor representative of the database as a whole. Or, in the case of a “static” dictionary, the dictionary may be accurate when created, but the nature of the data changes, causing the dictionary to become stale. This phenomenon frequently occurs with time-dependent data. The static dictionary may be created, for example, in February, when much of the underlying data includes the word “February”. When March arrives, most of the underlying data include the word “March” rather than “February”; since the original static dictionary does not contain “March”, its compression activities are poorly guided with respect to the current data sought to be compressed.
As a result of the foregoing conditions, certain data cannot be compressed because the data is missing from the dictionary. Yet, compression is still attempted for this data, albeit unsuccessfully. And these frustrated “compression calls” still require time to perform, occupying valuable processor time, which could otherwise be spent performing other tasks.
One option is to simply deactivate compression. Although this approach saves processor time otherwise spent on frustrated compression attempts, the input/output efficiency suffers because data is now stored in its full-length, uncompressed form. Another option is to rebuild the dictionary anew, in accordance with the current data. Rebuilding the dictionary, however, typically requires taking the data storage subsystem off-line. This is simply not an option for certain applications, where continuous data availability is crucial, such as automated teller machines, internationally accessible financial data, twenty-four hour telephone directory services, etc.
Consequently, known compression schemes are not completely adequate for some applications due to certain unsolved problems.