This application relates in general to data caches for computer systems, and in specific to the movement of data into and out of the caches.
Prior art computer systems have assorted problems with data movement into and out of cache. When these problems occur system performance is drastically curtailed.
One problem is cache bank conflicts which relates to the number of stores to cache that can be executed in a clock cycle. Many prior art computer systems allow a maximum of one load and one store to be executed per clock cycle. To perform a store, data is moved from a processor register to a store queue and then to data cache. If the store queue fills up, the processor stalls and system performance is reduced.
Some prior art systems also have an implementation constraint where a store immediately following a load to the same cache bank incurs additional delay as compared to any other combination of loads and stores to the cache banks. Prior art commonly interleaves loads and stores so the store queue cannot empty to cache as efficiently as desired causing performance degradation.
If the alternative loads and stores are used where the loads and stores are on the same cache line, a cache miss will occur on every load and store, and thus overlapping cannot be used. A cache miss occurs when the data to be loaded is not in the cache, but is in the main memory.
Another problem is cache collision. A cache collision occurs when two memory addresses map to the same cache address. Typically, a cache has a much smaller address space than the full memory, such that there are many to one mapping, i.e. many real memory addresses map to the same cache address. For example, if there are 1 million entries in the cache and 100 million entries in the main memory, then 100 main memory entries map to every cache entry. A typical mapping for a direct mapped cache may be memory addresses 1, 1 million+1, 2 million+1, etc., all map to cache address 1. Memory addresses 2, 1 million+2, 2 million+2, etc. all map to cache address 2. Therefore, when a data copy is performed where the source and destination addresses are exactly a multiple of the cache size apart, cache collisions will occur. Cache thrashing is where repeated cache collisions occur. There are two important special cases where cache thrashing occurs. The first is where two Unix processes, which share a parent process, are attempting to communicate by means of Unix supported shared memory. The process allocates shared memory for similar purposes at similar times. Thus, their data buffers for copying often have the same logical addresses, and map to the same cache lines. The second case occurs when two large arrays of data that are used in a computation are declared together, and the user declared them to be a multiple of the cache size. Then similar indexes in the arrays will map to the same cache lines. In other words, if array size is declared to be exactly one million, and the array is going to be copied to another array, also of size one million, then the array will systematically copy on to the same cache address. The effects of a cache collision is the loss of data, as the data is overwritten before the data is finished being used by the system. Moreover, a typical protocol will involve a mix of loads and stores for the two arrays. Each access for either array will wipe out several pieces of information for the other array. Moreover, the parallel operations mean that an exact address match does not have to occur before a cache collision occurs, the addresses can be close to each other and a cache collision will still occur. Thus, when data is to be copied and the source and destination address are nearly the same, but not exactly the same, performance is still reduced by cache collisions. Prior solutions do not make special provisions for handling nearly aligned source and destination addresses.
Thus there is a need in the prior art for a mechanism which maintains the store queue nearly empty, while minimizing both cache bank conflicts and cache collisions.
These and other objects, features and technical advantages are achieved by a system and method which determines if the memory source and destination addresses have nearly matching cache addresses or exactly matching cache addresses. If the source and destination addresses have different cache addresses then the data flow into and out of the cache operates in an enhanced mode. If the source and destination addresses have nearly matching cache addresses then the data flow into and out of the cache operates in a near cache collision avoidance mode. If the source and destination addresses have exactly matching cache addresses then the data flow into and out of the cache operates in an exact cache collision avoidance mode.
The enhanced mode of cache bank conflict avoidance carefully orders loads and stores so that loads to one memory bank are performed on the same clock cycles as the stores to the other memory bank. After a group of loads and stores are completed, then operations for each bank are switched so that a group of stores are performed on one bank and a group of loads are performed to the other bank. Thus, delays from switching between loads and stores are minimized since groups of operations are performed sequentially, while the data flowing into and out of the cache as a whole is equal, since one bank is performing stores while the other is performing loads.
The near cache collision avoidance mode determines the relative locations of the source and destination addresses within the cache. If the source address is slightly after the destination, then the enhanced mode will function properly. If the destination address is slightly after the source address, then groups of cache lines is loaded into registers, and then the registers are stored to memory without any interleaving of other loads and stores.
The exact cache collision avoidance mode restructures the loop of moving data to form a series of loads to get several cache lines staged for loading, each element of data is not only moved into the cache, but into registers. This is pipelined so that after this initial set of loads is performed, then additional loads are interleaved with non-cache conflicting stores to move new values into memory. By separating the time of the loads and stores for matching cache lines, full pipelining and multi-entry cache benefits is obtained.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and the specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.