1. Field of the Invention
The present invention relates generally to computer information technology systems and storage systems for storing data.
2. Description of Related Art
According to recent trends, a large amount of digital data is being archived in computer storage systems, such as disk array systems, in order comply with federal and state regulations, industry standards and practices, in addition to basic data archiving. For example, companies retain copies of email communications, data files, check images, and the like in archive storage systems. When a company has to retain and manage a massive volume of data over a long period of time, special purpose storage systems for data archiving are often used to reduce data management costs.
These storage systems have several functions that make it easier to safely retain and manage data for long periods of time. One of these functions is to keep management information (referred to hereafter as “metadata”) related to the archived data. Some metadata, such as keywords used for searching the data, is determined and set by clients of the storage system, such as through an archive application. This type of metadata is called “user” metadata. Other types of metadata are set automatically by the storage system itself. These types of metadata are called “system” metadata. For example, some types of storage systems might automatically calculate and store a hash value as part of the metadata for each data entry. The hash value is calculated by a cryptographic hash function, such as MD5, SHA1, SHA256, or the like, as is known in the art. By periodically recalculating the hash values for the stored data and comparing the newly calculated hash values with stored hash values calculated when the data was first stored, a storage system can automatically perform a check to determine whether or not there has been an unexpected change in the stored data, such as due to degradation of the storage mediums or other equipment after a long period of, time. Additionally, some storage systems use the hash value, a part of a hash value, or a value derived from the hash value as an address of the archived data. In this case, the address of the data is called a content address, and these storage systems are referred to as CAS (Content Addressed Storage) systems. Related art includes U.S. Pat. No. 6,807,632, to Carpentier et al., entitled “Content Addressable Information Encapsulation, Representation, and Transfer”, the entire disclosure of which is incorporated herein by reference.
However, while the above-described systems help reduce management costs, owners of large archive systems would also like to be able to reduce hardware costs as well. The fundamental solution for reducing hardware costs is to reduce the total amount of data stored in the archive systems, so that the required overall storage, capacity is reduced.
To reduce the amount of data stored in a storage system, data compression can be used as one solution. However, even after data is compressed, a hash value of the original data should not be removed because some applications might use the hash value as a content address, search key, or the like. Also, the hash value of the compressed data should be generated and maintained so that the storage system can use these to check the integrity of the data. If the storage system does not have the hash value of the compressed data, then the storage system must expand all the compressed data during each integrity check. Additionally, not every type of archived data is suitable for compression because some types of data, for example, images, audio files, and movies are already compressed before they are written to the storage system. Furthermore, it is not always effective to compress very small files because the amount of capacity actually saved is limited when compared with the CPU cycle consumed and the increase in access latency. Accordingly, there is a need for an ability to define and specify how and which data should be compressed, and then to effectively manage the compressed data along with the non-compressed data.