This invention relates in general to computer systems, and in specific to an arrangement for a cache memory system.
Computer systems may employ a multi-level hierarchy of memory, with relatively fast, expensive but limited-capacity memory at the highest level of the hierarchy and proceeding to relatively slower, lower cost but higher-capacity memory at the lowest level of the hierarchy. The hierarchy may include a small fast memory called a cache, either physically integrated within a processor or mounted physically close to the processor for speed. The computer system may employ separate instruction caches and data caches. In addition, the computer system may use multiple levels of caches. The use of a cache is generally transparent to a computer program at the instruction level and can thus be added to a computer architecture without changing the instruction set or requiring modification to existing programs.
Computer processors typically include cache for storing data. When executing an instruction that requires access to memory (e.g., read from or write to memory), a processor typically accesses cache in an attempt to satisfy the instruction. Of course, it is desirable to have the cache implemented in a manner that allows the processor to access the cache in an efficient manner. That is, it is desirable to have the cache implemented in a manner such that the processor is capable of accessing the cache (i.e., reading from or writing to the cache) quickly so that the processor may be capable of executing instructions quickly. Caches have been configured in both on chip and off-chip arrangements. On-processor-chip caches have less latency, since they are closer to the processor, but since on-chip area is expensive, such caches are typically smaller than off-chip caches. Off-processor-chip caches have longer latencies since they are remotely located from the processor, but such caches are typically larger than on-chip caches.
A prior art solution has been to have multiple caches, some small and some large. Typically, the smaller caches would be located on-chip, and the larger caches would be located off-chip. Typically, in multi-level cache designs, the first level of cache (i.e., L0) is first accessed to determine whether a true cache hit for a memory access request is achieved. If a true cache hit is not achieved for the first level of cache, then a determination is made for the second level of cache (i.e., L1), and so on, until the memory access request is satisfied by a level of cache. If the requested address is not found in any of the cache levels, the processor then sends a request to the system""s main memory in an attempt to satisfy the request. In many processor designs, the time required to access an item for a true cache hit is one of the primary limiters for the clock rate of the processor if the designer is seeking a single-cycle cache access time. In other designs, the cache access time may be multiple cycles, but the performance of a processor can be improved in most cases when the cache access time in cycles is reduced. Therefore, optimization of access time for cache hits is critical for the performance of the computer system.
Prior art cache designs for computer processors typically require xe2x80x9ccontrol dataxe2x80x9d or tags to be available before a cache data access begins. The tags indicates whether a desired address (i.e., an address required for a memory access request) is contained within the cache. Accordingly, prior art caches are typically implemented in a serial fashion, wherein upon the cache receiving a memory access request, a tag is obtained for the request, and thereafter if the tag indicates that the desired address is contained within the cache, the cache""s data array is accessed to satisfy the memory access request. Thus, prior art cache designs typically generate tags indicating whether a true cache xe2x80x9chitxe2x80x9d has been achieved for a level of cache, and only after a true cache hit has been achieved is the cache data actually accessed to satisfy the memory access request. A true cache xe2x80x9chitxe2x80x9d occurs when a processor requests an item from a cache and the item is actually present in the cache. A cache xe2x80x9cmissxe2x80x9d occurs when a processor requests an item from a cache and the item is not present in the cache. The tag data indicating whether a xe2x80x9ctruexe2x80x9d cache hit has been achieved for a level of cache typically comprises a tag match signal. The tag match signal indicates whether a match was made for a requested address in the tags of a cache level. However, such a tag match signal alone does not indicate whether a true cache hit has been achieved.
As an example, in a multi-processor system, a tag match may be achieved for a cache level, but the particular cache line for which the match was achieved may be invalid. For instance, the particular cache line may be invalid because another processor has snooped out that particular cache line. As used herein a xe2x80x9csnoopxe2x80x9d is an inquiry from a first processor to a second processor as to whether a particular cache address is found within the second processor. Accordingly, in multi-processor systems a MESI signal is also typically utilized to indicate whether a line in cache is xe2x80x9cModified, Exclusive, Shared, or Invalid.xe2x80x9d Therefore, the control data that indicates whether a true cache hit has been achieved for a level of cache typically comprises a MESI signal, as well as the tag match signal. Only if a tag match is found for a level of cache and the MESI protocol indicates that such tag match is valid, does the control data indicate that a true cache hit has been achieved. In view of the above, in prior art cache designs, a determination is first made as to whether a tag match is found for a level of cache, and then a determination is made as to whether the MESI protocol indicates that a tag match is valid. Thereafter, if a determination has been made that a true tag hit has been achieved, access begins to the actual cache data requested.
Turning to FIG. 7, an example of a typical cache design of the prior art is shown. Typically, when an instruction requires access to a particular address, a virtual address is provided from the processor to the cache system. As is well-known in the art, such virtual address typically contains an index field and a virtual page number field. The virtual address is input into a translation look-aside buffer (xe2x80x9cTLBxe2x80x9d) 710. TLB 710 is a common component of modern cache architectures that is well known in the art. TLB 710 provides a translation from the received virtual address to a physical address. Within a computer system, the virtual address space is typically much larger than the physical address space. The physical address space is the actual, physical memory address of a computer system, which includes cache, main memory, a hard drive, and anything else that the computer can access to retrieve data. Thus, for a computer system to be capable of accessing all of the physical address space, a complete physical mapping from virtual addresses to physical addresses is typically provided.
Once the received virtual address is translated into a physical address by the TLB 710, the index field of such physical address is input into the cache level""s tag(s) 712, which may be duplicated N times for N xe2x80x9cwaysxe2x80x9d of associativity. As used herein, the term xe2x80x9cwayxe2x80x9d refers to a partition of the cache. For example, the cache of a system may be partitioned into any number of ways. Caches are commonly partitioned into four ways. The physical address index is also input to the cache level""s data array(s) 716, which may also be duplicated N times for N ways of associativity.
From the cache level""s tag(s) 712, a way tag match signal is generated for each way. The way tag match signal indicates whether a match for the physical address was made within the cache level""s tag(s) 712. As discussed above, in multi-processor systems, a MESI protocol is typically utilized to indicate whether a line in cache is modified, exclusive, shared, or invalid. Accordingly, in such multi-processor systems the MESI protocol is combined with the way tag match signal to indicate whether a xe2x80x9ctruexe2x80x9d tag hit has been achieved for a level of cache. Thus, in multi-processor systems a true tag hit is achieved when both a tag match is found for tag(s) 712 and the MESI protocol indicates that such tag match is a valid match. Accordingly, in FIG. 7, MESI circuitry 714 is utilized to calculate a xe2x80x9ctruexe2x80x9d tag hit signal to determine whether a true tag hit has been achieved for that level of cache. Once it is determined from the MESI 714 that a xe2x80x9ctruexe2x80x9d tag hit has been achieved for that level of cache, then that cache level""s data array(s) 716, which may also be duplicated N times for N ways of associativity, are accessed to satisfy the received memory access request. More specifically, the true tag hit signal may be used to control a multiplexer (xe2x80x9cMUXxe2x80x9d) 718 to select the appropriate data array way to output data to satisfy the received memory access request. The selected data from data array(s) 716 is output to the chip""s core 720, which is the particular execution unit (e.g., an integer execution unit or floating point execution unit) that issued the memory access request to the cache.
In view of the above, prior art caches are typically implemented in a serial fashion, with each subsequent cache being connected to a predecessor cache by a single port. Thus, prior art caches have been only able to handle limited numbers of requests at one time. Therefore, the prior art caches have not been able to provide high enough bandwidth back to the CPU core, which means that the designs of the prior art increase latency in retrieving data from cache, which slows the execution unit within the core of a chip. That is, while an execution unit is awaiting data from cache, it is stalled, which results in a net lower performance for a system""s processor.
These and other objects, features and technical advantages are achieved by a system and method which uses an L1 cache that has multiple ports. The inventive cache uses separate queuing structures for data and instructions, thus allowing out-of-order processing. The inventive cache uses ordering mechanisms that guarantee program order when there are address conflicts and architectural ordering requirements. The queuing structures are snoopable by other processors of a multiprocessor system. This is required because the tags are before the queues in the pipeline. Note that this means the queue contains tag state including hit/miss information. When a snoop is performed on the tags, if it is not also performed on the queue, the queue would believe it has a hit for a line no longer present in the cache. Thus, the queue must be snoopable by other processors in the system.
The inventive cache has a tag access bypass around the queuing structures, to allow for speculative checking by other levels of cache and for lower latency if the queues are empty. The inventive cache allows for at least four accesses to be processed simultaneously. The results of the access can be sent to multiple consumers. The multiported nature of the inventive cache allows for a very high bandwidth to be processed through this cache with a low latency.
The inventive cache uses a queuing structure which provides out-of-order cache memory access support for multiple accesses, as well as support for managing bank conflicts and address conflicts. The inventive cache manages architectural ordering support. In prior art, it has been difficult to provide multiple concurrent access support. The inventive cache can support four data accesses that are hits per clocks, support one access that misses the L1 cache every clock, and support one instruction access every clock. The responses, for example, fills and write-backs, are interspersed in the pipeline, so that conflicts in the queue are minimized. Non-conflicting accesses are not inhibited, however, conflicting accesses are held up until the conflict clears. Thus, the inventive caches has better access conflict management in the issuing from the queuing structure. An essential component to this cache is the out-of-order support. The inventive cache provides significant out-of-order support after the retirement stage of a pipeline, which is different from other out-of-order pipeline implementations. This implementation can operate on cache accesses known to be needed by the CPU core. An out of order implementations may have to stop servicing an access if an older access faults.
It is a technical advantage of the invention to be able to issue four accesses per clock and retire four accesses per clock on the data queue, and be able to issue one instruction access per two clocks and retire one instruction access per clock.
It is another technical advantage of the invention to embed bank conflict and address conflict mechanisms in the queue in order to be able to more efficiently issue four accesses per clock.
It is a further technical advantage of the invention to embed architectural ordering support in the queue so that accesses that are not currently able to be issued due to ordering constraints can be skipped and accesses that can be done based on their ordering constraints are issued.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.