A computer system spends a significant portion of its time performing bulk data operations. Bulk data operations degrade both system performance and energy efficiency because bulk data operations use high number of transfers over a memory channel which couples a memory chip with a memory controller. For example, a typical memory system today (e.g., using Double Data Rate 3 (DDR3)-1066) takes roughly a microsecond (i.e., 1046 nanoseconds) to copy 4 Kilo Bytes (KB) of data by transferring that data over a memory channel. One microsecond of latency, in today's high speed memories, is high latency which degrades performance of the computing system. Such high latency can degrade performance of concurrently-running applications that share the bandwidth of the memory channel.
Another type of data operation that may cause high latency (i.e., increase in the number of data transfers over the memory channel) is presetting or resetting contents of a block of memory. Presetting/resetting operations are typically used in graphics or display applications where such applications desire to clear or wipe the output of some or all of the display contents (e.g., to make some portion of the image display black or white completely). One way to zero the contents of a block of memory (i.e., to reset the memory) is to write zero to the block of memory by transferring data indicating zero over the memory channel. Such a method of resetting the block of memory uses a large number of data transfers over the memory channel.
Another way to zero the contents of a block of memory is to use high-level software programming functions such as “Memset(ptr, 0, nbyte)” and “calloc( )”. These software functions are usually implemented as programming loops of store instructions. Store or write instructions cause high data transfers over the memory channel. With Advanced Vector Instructions (AVX), it is possible to clear or set 256 bytes at a time using a single instruction. However, to clear a whole page (e.g., 4 KB), the AVX instructions need to loop through 128 times, which is both time and power consuming.
Another example of a data operation that may cause high latency (i.e., increase in the number of data transfers over the memory channel) is inverting or complementing a large amount of raw data in a block of memory. The process of inverting or complementing a large amount of raw data is typically used in image processing where it is often desired to get a negative of an image. One way to invert or complement a large amount of raw data is to transfer the inverted or complemented data over the memory channel to the memory chip, and then writing that inverted or complemented data to the block of memory.
One such image processing operation is performed by a digital camera. In this case, the digital camera creates the image, stores the image in raw format, and creates the negative image for further image processing. When creating the negative image, the particular hardware (in this example, the digital camera) must go through the steps of reading the data word by word, complementing the word, and then storing the inversed word back in the image format. It takes both time and power or go through the image one word at a time.