1. Technical Field
This application relates to optimizing compression based on data activity.
2. Description of Related Art
To reduce the storage space taken by data that is stored on a data storage device, the data may be compressed at the time of writing, e.g., compressed data, rather than uncompressed data, is written to the data storage device.
Some algorithms perform compression by replacing often-repeated sequences of text or data with substitute placeholder that is smaller in size than the sequence that it replaces. The larger the sequences that are replaced, the more efficient the algorithm becomes, allowing the compressed file to be smaller and smaller relative to the size of the uncompressed file. Such compression algorithms will first search the entirety of the data to be compressed, called the compression domain, to find the largest sequences that are repeated within the compression domain. The larger the compression domain, the more opportunities there are for the compression algorithm to find larger repeating sequences. For this reason, compression algorithms tend to be more efficient as the compression domain gets larger.
However, in systems where the content of the compression domain changes, the compressed data must be decompressed, modified, and recompressed. For this reason, system performance tends to be less efficient as the compression domain gets larger, because a change to any portion of a large compression domain requires that the entire compression domain be decompressed, modified, and recompressed. If, instead, the dataset is divided into smaller compression domains, a change to one portion of the dataset requires decompression and recompression of a smaller portion of the dataset.
Choosing the size of the compression domain is therefore a balance between compression efficiency and system performance. Choosing the boundaries of the compression domains can also have a significant effect on compression efficiency and system performance. A compression domain may be very large if the data within its boundaries—i.e., the contents of that compression domain—do not change very often.
One approach to selecting the boundary of a compression domain is to attempt to encompass data that will be likely to change together or not change at all. That is, if one portion of the data within the compression domain changes, other portions of the data within the compression domain are also likely to change, resulting in a decompress-modify-modify-modify-recompress strategy. This increases efficiency because the decompress and recompress operations are typically more resource-intensive than the modify operations. In other words, if the system has to go to the trouble of decompressing and recompressing, the overhead caused by multiple modifications is relatively small in comparison.
One conventional example of this approach is what is herein referred to as “file-based” compression, in which the file construct is the boundary of the compression domain: one file is one compression domain, another file is another compression domain, and so on. When one portion of a file changes, it is likely that other portions of the same file also change, but does not increase the likelihood that another file will also change.
There are disadvantages to file-based compression, however. Storage devices may not operate at the file level, and thus may not even be aware of the file construct. For example, a hard disk drive may respond to requests for logical blocks of available storage space, without knowing to which file, if any, those logical blocks belong. Thus, file-based compression cannot be implemented by a low level entity, such as the storage device, but must be controlled by a higher-level entity, i.e., one that is aware of the file structure and the mapping of file to logical or physical addresses within the storage device. In file-based compression, the file itself is compressed, but the meta-data that describes the file or its location, such as the directory entry for the file, is not compressed. For this reason, the file contents must be compressed before being sent to the data storage device: the data storage system receives commands to write the already compressed data to the data storage device. Furthermore, the file system must maintain information with each file to indicate whether the file data that is stored on the data storage is device is compressed or uncompressed data.
Another conventional approach is for the compression domain to be equal to the unit of reservation or unit of allocation used by the data storage system. For example, multiple data storage devices may collectively provide a pool of data storage blocks that may be allocated to logical units or reserved by processes. In this scenario, each logical unit or portion thereof may be a separate compression domain. Under this approach, the compression domain is not based on a storage block's membership in a file, but on the storage block's membership with a unit of reservation or a unit of allocation. This approach has the advantage that compression can be performed at a low level, e.g., by the allocation or reservation entity or even by the storage device itself, without having to know the higher-level file or directory structures. Furthermore, the file system operates as if every file is uncompressed, and will send and receive uncompressed file data, which is silently compressed before write to the data storage device and uncompressed upon read from the data storage device.
However, there are disadvantages to this approach, as well. In systems where the logical unit has been selected as the compression domain, any write into the logical unit can potentially require (and probably will require) the decompress-modify-recompress operation to be performed. In addition, regardless of the size of the compression domain, the decompress and recompress steps are resource-intensive (and therefore also time-intensive). Furthermore, the compression operation is multiple times more resource intensive than the decompression operation. For systems that perform multiple decompress-modify-recompress operations, this can cause a severe bottleneck in performance when reading from and writing to a compressed logical unit.