1. Field of the Invention
The present invention generally relates to computer systems, and more particularly to a method and device for arbitrating cache operations in a multi-port cache of a processing unit, i.e., a cache memory that is capable of being accessed by multiple input addresses within a single processor cycle.
2. Description of Related Art
The basic structure of a conventional computer system includes one or more processing units connected to various input/output devices (such as a display monitor, keyboard, and permanent storage device), and a system memory device (such as random access memory or RAM) that is used by the processing units to carry out program instructions. The processing units communicate with the other devices by various means, including one or more generalized interconnects. A computer system may have many additional components such as serial and parallel ports for connection to, e.g., modems or printers, and other components that might be used in conjunction with the foregoing; for example, a display adapter might be used to control a video display monitor, a memory controller can be used to access the system memory, etc.
A typical processing unit includes various execution units and registers, as well as branch and dispatch units which forward instructions to the appropriate execution units. Caches are commonly provided for both instructions and data that are loaded into these logic units and registers, to temporarily store values that might be repeatedly accessed by a processor. The use of a cache thus speeds up processing by avoiding the longer step of loading the values from the system memory (RAM) or from some other distant component of the memory hierarchy. These caches are referred to as xe2x80x9con-boardxe2x80x9d when they are integrally packaged with the processor core on a single integrated chip. Each cache is associated with a cache controller or bus interface unit that manages the transfer of values between the processor core and the cache memory.
A processing unit can include additional caches, such as a level 2 (L2) cache which supports the on-board (level 1) caches. In other words, the L2 cache acts as an intermediary between system memory and the on-board caches, and can store a much larger amount of information (both instructions and data) than the on-board caches can, but at a longer access penalty. Multi-level cache hierarchies can be provided where there are many levels of interconnected caches, as well as caches that are grouped in clusters to support a subset of processors in a multi-processor computer.
A cache has many blocks which individually store the various instruction and data values. The blocks in any cache are divided into groups of blocks called xe2x80x9csetsxe2x80x9d or xe2x80x9ccongruence classes.xe2x80x9d A set is the collection of cache blocks that a given memory block can reside in. For any given memory block, there is a unique set in the cache that the block can be mapped into, according to preset mapping functions. The number of blocks in a set is referred to as the associativity of the cache, e.g., 2-way set associative means that for any given memory block there are two blocks in the cache that the memory block can be mapped into; however, several different blocks in main memory can be mapped to any given set.
An exemplary cache line (block) includes an address tag field, a state bit field, an inclusivity bit field, and a value field for storing the actual instruction or data. The state bit field and inclusivity bit fields are used to maintain cache coherency in a multi-processor computer system (indicating the validity of the value stored in the cache). The address tag is a subset of the full address of the corresponding memory block. A compare match of an incoming address with one of the tags within the address tag field indicates a cache xe2x80x9chit.xe2x80x9d The collection of all of the address tags in a cache is referred to as a directory (and sometimes includes the state bit and inclusivity bit fields), and the collection of all of the value fields is the cache entry array.
For a high-speed processor device such as a superscalar, reduced instruction set computing (RISC) processor wherein more than one instruction can be executed during a single processor cycle, demands for simultaneous multiple accesses to the cache memory are increasing. The processor device may have to access more than one effective address and/or real address of the cache memory in a single processor cycle, in order to take full advantage of the RISC performance. Hence, a cache memory is often partitioned into multiple subarrays (interleaved) in order to achieve single-cycle, multi-port access. An interleaved cache memory has the potential of being accessed by more than one address and producing more than one output value in a single processor cycle.
Although various arrangements of subarrays allow simultaneous multiple accesses to the cache memory, each of these accesses must still be in a separate subarray of the cache memory, because only one cache line within a single subarray can be driven by the wordline driver circuit at a given time. Thus, if more than one access to the cache lines in a single subarray is attempted, arbitration logic of the cache memory must be used to select one of the accesses to proceed before the rest. Prior art caches, however, can require an excessive amount of time to arbitrate between the blocks, due to the manner in which a conflict (contention) is handled. Partial addresses are analyzed when sent to a cache, and compared to determine if they are accessing the same block. This approach requires added cycles after generation of the effective address.
Conventional conflict detection/resolution is depicted in FIG. 1. A load/store unit 1, associated with a processor, generates a store cache address which is forwarded to a queue 2. In the depicted embodiment, queue 2 holds up to eleven store operations which have been generated by load/store unit 1. At some point in time, one or more operations will be pending in queue 2, and load/store unit 1 will execute new load operations, generating another cache address. It is possible that the new operation from load/store unit 1 can be executed at the same time that the next store operation in queue 2 is executed, since the cache 3 is interleaved, having a first block 4 and a second block 5. Simultaneous execution of the operations can only occur, however, if the two operations are directed to values which are stored in the two different blocks (subarrays) 4 and 5. Conflict detection logic 6 (in the data unit control) evaluates the subarray selection bit(s) within the effective addresses to determine if a conflict exists. Conflict detection logic 6 is part of the cache control logic (hit/miss logic).
For example, in a 2-subarray PowerPC(trademark) cache, bit 56 of the effective address field is evaluated. Conflict detection logic 6 is connected to the same buses that connect load/store unit 1 and queue 2 to each of the subarrays 4 and 5 of cache 3. If a conflict does exist, the data unit control stalls the load access (from load/store unit 1), by forcing a xe2x80x9cretryxe2x80x9d at a data unit load miss queue 7, and allows the other access (from queue 2) to proceed. The delayed operation is re-issued at the next available time slot, which is at least three cycles after the original access attempt. DU load miss queue 7 is similar to a typical miss queue in front of a cache, that keeps addresses (real and effective) of load accesses which missed the cache. When the miss is xe2x80x9cresolved,xe2x80x9d the address is transmitted back through the cache, a hit occurs, and the data are forwarded to their respective destinations. xe2x80x9cResolvingxe2x80x9d the miss can mean the typical case, as in a miss in the cache, wherein data is reloaded from the L2 cache or beyond, allocated to the cache, and then the operation is retried from miss queue 7. Alternatively, in the case of a conflict, the effective address is placed in the miss queue, and a retry occurs when possible. Arbitration into a multiplexer (not shown) above each of the cache blocks 4 and 5 is completed in one cycle; the address is driven into the cache during the next cycle using a latch (also not shown), and data are returned the following cycle.
Another drawback with conventional contention logic relates to simultaneous access (loads only) of a single cache address. Such simultaneous access is detected as a conflict, requiring arbitration, but this condition should not be considered contention since accesses of the same word or double word in the form of load operations can be satisfied by forwarding the value to multiple units at the same time.
In light of the foregoing, it would be desirable to devise an improved method for resolving address contention in a multi-port cache, which decreases delays associated with conflict detection. It would be further advantageous if the conflict resolution circuits could detect that the same word was being accessed by the different cache address, and so non-contending, and forward the word to those ports such that simultaneous multiple-access capability can be greatly enhanced.
It is therefore one object of the present invention to provide an improved cache memory for a high-speed data processing system.
It is another object of the present invention to provide such an improved cache memory which allows multiple accesses in a single processor cycle.
It is yet another object of the present invention to provide an improved cache memory which more efficiently handles actual and apparent address conflicts.
The foregoing objects are achieved in a processing unit, generally comprising a first cache access circuit generating a first cache address, a second cache access circuit generating a second cache address, an interleaved cache connected to said first and second cache access circuits, and means for stalling the first cache access circuit in response to detection of a conflict between the first cache address and the second cache address. The stalling means includes means for detecting contention based on one or more subarray selection bits in each of said first and second cache addresses, and further preferably includes a common contention logic unit for both the first and second cache access circuits. The stalling means retains the first cache address within the first cache access circuit so that the first cache access circuit does not need to re-generate the first cache address.
The present invention further conveniently provides means for determining that the same word or double word is being accessed by both cache access units. This condition is not considered contention since the doubleword being accessed can be forwarded to both units at the same time, so both operations are allowed to proceed, even though they are in the same subarray of the interleaved cache.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.