1. Technical Field
The present invention relates in general to data processing systems and, in particular, to data cache accesses. More particularly, the present invention relates to an address generation circuit that utilizes history-based predicted carry-in values for partial sum adders that are utilized for generating data cache addresses.
2. Description of the Related Art
The use of data caches for performance improvements in computing systems is well known and extensively used. A cache is a high speed buffer which holds recently used memory data. Due to the locality of references nature for programs, most of the access of data may be accomplished in a cache, in which case slower accessing to bulk memory can be avoided. In typical high performance processor designs, the cache access path forms a critical path. That is, the cycle time of the processor is affected by how fast cache accessing can be carried out.
A cache may logically be viewed as a table of data blocks or data lines in which each table entry covers a particular block or line of memory data. The implementation of a cache is normally accomplished through three major portions: directory, arrays and control. The directory contains the address identifiers for the cache line entries, plus other necessary status tags suitable for particular implementations. The cache arrays store the actual data bits, with additional bits for parity checking or for error correction as required in particular implementations. Cache control circuits provide necessary logic for the management of cache contents and accessing. Upon an access to the cache, the directory is accessed or “looked up” to identify the residence of the requested data line. A cache hit results if it is found in the cache, and a cache miss results otherwise. Upon a cache hit, the data may be accessed from the array if there is no prohibiting condition, e.g., protection violation. Upon a cache miss, the data line is normally fetched from the bulk memory and inserted into the cache first, with the directory updated accordingly, in order to satisfy the access through the cache.
Since a cache only has capacity for a limited number of line entries and is relatively small compared with the bulk memory, replacement of existing line entries is often needed. The replacement of cache entries in a set associative cache is normally based on algorithms such as the Least-Recently-Used (LRU) scheme. That is, when a cache line entry needs to be removed to make room for, i.e., replaced by, a new line, the line entry that was least recently accessed will be selected. In order to facilitate efficient implementations, a cache is normally structured as a 2-dimensional table. The number of columns is called the set-associativity, and each row is called a congruence class. For each data access, a congruence class is selected using certain address bits of the access, and the data may be accessed at one of the line entries in the selected congruence class if it hits there. It is usually too slow to have the cache directory searched first, e.g., with parallel address compares, to identify the set position (within the associated congruence class) and then to have the data accessed from the arrays at the found location. Such sequential processing normally requires two successive machine cycles to perform, which degrades processor performance significantly.
Generally, most, if not all, conventional computer architectures require that the cache storage addresses are generated by an address addition of a displacement, or index, with a base register value or address. This addition requires that at least one or more additional pipeline cycles to accomplish, thus, increasing the latency of a data cache access. Sum address and zero delay arithmetic and operand address generation (AGEN) schemes limit the delay penalty by implementing only a few bits of the address adder at a time and generating only a partial sum, e.g., 2-4 bits at a time, assuming that there is no carry-in to the addition, to start a cache access. However, the bits that are utilized to start an access are not the least significant bits, but are higher order bits. These higher order bits are also of higher order than the bits addressing bytes within the cache line that is typically 64-256 bytes or 6-8 bits. Thus, for example, if bits 57-63 of a 64 bit address are utilized to address the bytes within the cache line, bits 50-56 could be used as the address index to begin the data cache access.
A basic scheme for partial addition groups without carry propagation will herein be described in conjunction with FIG. 1 that illustrates a 128 byte data cache line partial sum address generation example. As shown, the effective address addition are broken down into multiple, i.e., two or more, small adder portions comprising 2-3 bits each. To improve the access time, either the carry from the 7 bit Line Access Select (LAS) adder is ignored and assumed to be zero, which is true for about 80-90% of the time, or multiple read access paths must be implemented in the data cache to account for the carry-in and not carry-in cases. However, for either of the above described schemes, even though it is better than performing the entire address generation routine and taking another pipeline cycle, there are inherent limitations.
In the case where the 7 bit LAS addition carry-out is simply assumed to be 0, errors are introduced 10-20% of the time when this assumption is incorrect. In this scenario, the resulting address index utilized for the Row Access Select (RAS) and the Column Access Select (CAS) are incorrect. This requires that the cache must be re-accessed with the correct RAS and CAS address index. Conventionally, a single cycle stall and retry would be possible to access the cache with the correct address index. However, future microprocessors architectures are anticipated to have deeper pipelines and frequencies scaling that are much faster than the circuit and wire delays. In these environments, it may take, for example, three or more processor cycles to stop the pipeline process and retry the data cache access, thus negating any time savings in the address generation routine from assuming that the carry-out from the 7 bit LAS addition is zero.
For the case where a read access path is created to access the cache with a RAS and CAS index without a carry-in and a second read access path is utilized to access the cache with a RAS and CAS index with a carry-in, i.e., an extra two-way late selection mechanism, a delay to the data cache array itself is introduced. More importantly, however, is that an additional increase in the order of 50-100% in power dissipation and increase in the chip area to the data cache design is introduced to accommodate the multiple read paths. In systems operating at or above an operating frequency of, e.g., 5 Ghz, power considerations are one of the most important design limitations. A large power dissipation on a large area, such as the data cache, may ultimately force the operating frequency down due to the lowering of the supply voltage by the increased power dissipation.
Accordingly, what is needed in the art is an improved address generation methodology that mitigates the limitations discussed above. More particularly, what is needed in the art is a more effective carry prediction scheme.