Data compression is a process widely used to reduce the number of bytes that construct data over a predefined length. This can be done in many ways; however the fundamental method requires recognising patterns of data and representing them with smaller digital lookups or symbols. This method removes as much redundant information as possible without compromising data integrity or generating a loss of information. It is most effective when the original data block contains highly repetitive patterns. For example, a database containing dates and addresses can have all common words symbolised, such as January, February or March. Where instead of storing words, the symbol ‘1’, ‘2’ or ‘3’ can be used. In addition common number sequences can also be symbolised, such as postal codes or years.
Compression can be used to reduce the amount of storage required to store data; to increase the perceived throughput when transferring data from one location to another; and to reduce repeating patterns before performing an encryption process on a particular data set.
Compression can be used wherever data contains redundant information, such as emails, documents and databases. All of which have repetitive elements, such as the spaces between words in a document, formatting information or repeating data patterns such as numbers in a spreadsheet. A good example of where compression is used in industry is within newspaper companies, where words and sentences are used and reused hundreds or thousands of times making it easy to take these commonly used words and store them in an optimised manner.
The benefits of data compression may only go as far as the quality of the data being compressed. If the data has a lot of repeating patterns then the effects will result in a faster compression rate as well as a smaller data payload. For example, if a data payload contains only one recognised pattern then it is easy to identify that pattern repeatedly and re-encode the data. However if this data payload was full of hundreds of patterns to be recognised then every pattern lookup will take longer and the re-encoded data will have to contain all of these patterns and the order that they occurred within that payload. If the data has no repeating patterns then the compression time will be longer because the compression algorithm will be continuously looking and storing potential patterns, and as a result the re-encoded data size could be even larger than the original.
Data compression is not used on data where the level of redundancy is low. This includes data that has been encrypted or is already compressed, simply because reducing redundant information is one of the main purposes of these operations. Video files, for example, have usually gone through a compression process and so would not compress well. Another example of data that are not easily compressible would be medical images, which are generally encrypted for data protection.
The rate that data can be compressed is dependent on three elements: the algorithm used for the process, the hardware performing the process and the quality of the data being processed.
There are many different compression algorithms available, each having a different level of efficiency and performance. Although this level differs from one algorithm to the next it is constant. This means that if the same data set were to be processed multiple times it would take exactly the same number of operations to run the process, outputting the same data payload each time. Even if the data set differed, as long as it is of the same quality or compressibility then the number of operations to run the process and the compression efficiency of the output will be relatively similar for a given algorithm.
The hardware performing a compression operation could have a single processing core, 32 processing cores or even hardware offloading to improve system performance. Regardless of what the hardware is or how fast it can run, the amount of time that it takes to run an operation is constant.
The data that goes through a compression process is the most changeable element that influences the performance. On a given system, using a predefined algorithm, the amount that the data can be compressed will differ from one set to the next. As stated above, if a data set has a lot of redundant information then the overall process will be faster than if the data had very little redundancy.
In summary, the rate of compression is directly proportional to the performance of the hardware, the algorithm used and the quality (or compressibility) of the data.
Performing data compression is a very computationally intensive task. If a compression algorithm has the ability to run on multiple CPUs then it has the potential to consume a large amount of a system's resources, leaving little or no processing time for performing other operations in the system. This can be a problem if a system has multiple roles to perform, especially if any of these roles are time critical.
Some compression methods can only utilise a single core because of the sequential processing method used on a file or a stream of data. Where such a compression algorithm is used the maximum rate of compression is limited to the speed of that core, irrespective of how many cores are available within the system. Under this scenario running other operations in the system is generally not a problem, limitations occur when other operations are dependent on the compressed data such as a transmission system.
Some servers use compression in conjunction with a network to perform backup or remote copy operations. These systems sometimes work in pairs, where the data is compressed, sent, received and then decompressed. Other systems may compress the data before sending it to a network storage device, such as NFS, Block iSCSI or a Windows file share. In each of these cases the server depends on the data being compressed before transmitting it.
Network Interface cards (NICs) are increasing in speed such as 1 Gb, 10 Gb or 40 Gb, yet the input/output (I/O) interface and protocol to the computer system generally remain the same. This means that a computer administrator can simply replace a NIC to either gain network performance or to keep compatibility with an evolving network infrastructure.
If a system, such as one of the examples above using compression prior to sending data, were to have its NIC upgraded to accommodate a faster network infrastructure then it may be that the actual gain is negligible. This could be if all of the available processing time is already being used for compression, leaving only a small amount of resources to perform the network operations. Similarly, because the compressed data from this system is to be transmitted after compression then the result of replacing the NIC could show zero increase in performance if the limiting factor is the time to compress the data prior to transmission.
Tests have shown that a computer system running an Intel Xeon X5660 Hex Core 2.8 GHz processor, which has 6 cores, can compress data to 24% in size at a rate of 411 MB/s when utilising all cores. This means that compressing this data before transmitting it over a 1 Gb network link can increase the perceived user throughput by a factor of 4, making it look like the network link is 4 Gb/s just by sending less data. Doing this is beneficial to the overall system. However, if the same system has the NIC upgraded to 10 Gb/s then the throughput would still be 411 MB/s, utilising none of the additional network performance available. This is simply because the limiting factor is now the compression process. Under this scenario the compression process is not beneficial but a problem because it is slowing down the potential transfer rate of the data, where a 10 Gb network link can reach a perceived user throughput of 1280 MB/s, which is now limited to 411 MB/s (down to 32% of the network).
The problem being identified can be seen all over the world as a result of network links over distance becoming both cheaper and faster. When network links were slower and more expensive, compression was used in series with network transmission as a way to give users the perception of a faster link by reducing the amount of data being transferred. As network links have increased in speed that compression process has become a limiting factor. This has resulted in compression limiting the performance to such a point that the network utilisation drops to below 100% despite a reduction in data quantity.