Technical Field
The present invention relates to data reduction and, more particularly, to reduction of data at the storage and network level by prioritizing data according to compressibility and duplication.
Description of the Related Art
Data reduction is used at both the storage level and at the network level to reduce the amount of data that is stored or moved over a connection. Various data reduction techniques are available according to the scenario and goal at hand. In most storage cases, particularly in archival systems, the goal is to obtain the highest compression ratio and, thus, reduce the amount of storage needed. In network transmissions, the goal is to complete data transfers as fast as possible.
When time is a factor, such as in network transmissions, compression time may play a role alongside compression ratio in determining how best to compress the data, due to the fact that a time-consuming compression process will add a substantial burden on the latency of transmissions. Compression engines have been introduced with the intent of obtaining meaningful compression and maximum speeds, attempting to find the optimal tradeoff between the two. Other approaches to data reduction include de-duplication, which quickly identifies if data already exists at the target location and, if so, foregoes the actual transfer, instead identifying the portions of data at the destination that may be used to reconstruct the data being transmitted.
However, existing data reduction approaches often fail to take into account the actual content of the data itself. In addition, existing approaches fail to re-organize data handling when large amounts of data are to be transferred. In a naïve approach, one cuts the data into chunks and compresses each chunk with multiple engines to find the best fit to transmit. However, this is resource intensive, slow, and includes data that is incompressible. As a result, the existing approaches to data reduction for network transmissions are sub-optimal.