In modern computer systems, data traffic on or off of an integrated circuit chip to external memory is a major performance bottleneck, and consumes a significant portion of the dissipated system energy. As such, memory bandwidth requirements can limit the thermal design point (TDP) of the system and inhibit performance scaling. Through the use of cache hierarchies, a lot of this memory traffic is avoided, as the most recently used data can be found close to the processor (e.g., a central processing unit (CPU) or graphics processing unit (GPU)). However, even with infinite caches, compulsory cache misses lead to memory traffic.
Current systems can only compress data from certain caches and the compressed data is not handled in a uniform way by the CPU and the GPU. For example, if the GPU can compress parts of a render target and if the CPU later wants to read that render target, then the entire render target needs to be decompressed and sent to the caches in the CPU.
Previous methods rely on specific handling of compression for specific types of data. This is necessary because of the nature of compression—different types of data lend themselves differently to various methods of compression. However, previous methods also handle compression control specifically and separately, making it hard (in terms of design and validation) to introduce compression at new locations in the system.