The present invention relates generally to the field of computer memory, and more particularly to optimizing cache bandwidth consumption.
Memory latency and memory bandwidth limitations are two important factors that limit the performance of some applications. Memory latency defines how long it takes for a computer processor command to request data from the memory and the memory returning the data requested. Memory bandwidth is a measure of how fast the data flows from memory to the processor. However, memory bandwidth and memory latency are tradeoffs. The greater the bandwidth, the longer it takes to assemble all of the memory data that is being sent to the processor. Assembling 64 bits of data just slows down the overall transaction when the processor only requires one byte.
Memory bandwidth limitations are likely to become worse with the current trend towards multithreaded and multi-core processors, since the memory bandwidth is increasing much more slowly than the speed of the processors. Various optimization techniques have been proposed to reduce memory latency and to improve the memory bandwidth utilization. One such technique is data splitting performed by a compiler operation.
A compiler translates a software program written in a high-level programming language that is suitable for human programmers, into the low-level machine language that is required by computers. Data splitting has been proven to be an effective compiler transformation to improve data locality and reduce the memory footprint, resulting in better data cache efficiency, especially for loop iterations that only manipulate certain fields of the array. In the existing production compilers, an array of data structures is split into two or more arrays of smaller data structures in terms of the structure fields and the splitting is applied across the entire program by modifying all the references of that structure type. When two different regions in an application access the same hot fields with different code patterns, this data splitting mechanism may not realize the complete performance potential possible. Consider the following example code abstracted from memory-bound benchmark of CPU2006 libquantum (gates.c):
for (i=0; i<reg−>size; i++ {if ((reg−>node {i}.state & ((MAX_UNSIGNED) 1 << control)))reg−>note [i].state {circumflex over ( )}= ((MAX_UNSIGNED) 1 << target);}
This is one of the hottest loops in the benchmark. The issue with this loop is the poor cache utilization as a result of access to the 16-byte struct “node” (which in turn is part of reg struct). Every time an access to the struct node is made, only one or two bits of the variable “state” are used, whereas the other half of the struct “node” and the other bits of the variable “state” are wasted in the cache as they are eventually evicted from the cache without being used. Moving unwanted data into the cache is a waste of memory bandwidth and cache. Existing compilers improve the cache utilization by splitting the two fields of the struct “node” into two separate arrays. This may improve cache utilization but may be still be far short of the optimal cache utilization, however further splitting data may result in bit manipulation overhead in other regions of the program when the variable state is accessed differently.