Data compression is utilized to reduce storage space for large files or data structures by encoding the information using fewer bits than the original data set. The data compression can be either lossy or lossless. Lossy compression reduces the number of bits, at least in part, by discarding less important or critical information or approximating the information. Lossless compression reduces the number of bits by identifying and reducing statistically redundant bits. Use of compression in either form is useful for reducing the required storage capacity of systems, transmission rates of network connections and similar computational resources. Some compression processes are not well suited for compressing certain data sets. A user can manually select a compression process that is well suited for a particular data set to improve the compression level of the data set. One example application of data compression is in backup storage management systems.
Backup storage management involves the collection of client data from a set of client systems that is to be stored remotely as a backup to the client data of the set of client systems in the case of a failure or corruption of the client data. For large scale systems, backup storage management systems attempt to minimize the storage, processing and communication resources required for client data to be backed up. This optimization includes the compression and de-duplication of the client data. However, the compression and de-duplication process also increases the overhead of the backup storage process. In some cases, the type of compression utilized in the backup process is ineffective where the client data set is not amenable to the type of compression process utilized or where the type of compression process is more resource intensive than would be available to compress a given client data set. Efforts to improve on the compression process are based on matching the content type with a particular type of compression process that is efficient at compressing that content type. This involves the manual correlation of the content type with the type of compression process or a specific compression process. This approach may only be applicable where the data to be compressed is of a known type and discretely separated from other differing types of content.
For example, some compression processes are efficient at compressing image data. If data to be compressed are identified to be image data, then the administrator managing the compression process can select a compression process that is efficient at compressing such image data. However, some applications of compression are the context of a general environment, such as a large scale backup management system, which may service a wide variety of clients and backup a wide variety of content even within specific client data sets. The process of individually specifying or correlating content with compression algorithms that are efficient is too time and resource intensive because it requires a manual correlation and sufficient information about the client data set. As a result, general environments such as backup storage management systems use general compression processes that have the ability to compress many different types of content with a moderate level of efficiency (i.e., moderate compression levels and moderate resource consumption). However, as a result of the use of these general compression processes, there are many instances where resources could have been more efficiently allocated by using a different compression process.