This application is related in general to cache memory subsystems, and in specific to on-chip caches with queuing structures and out-of-order caches.
Computer systems may employ a multi-level hierarchy of memory, with relatively fast, expensive but limited-capacity memory at the highest level of the hierarchy and proceeding to relatively slower, lower cost but higher-capacity memory at the lowest level of the hierarchy. The hierarchy may include a small fast memory called a cache, either physically integrated within a processor or mounted physically close to the processor for speed. The computer system may employ separate instruction caches and data caches. In addition, the computer system may use multiple levels of caches. The use of a cache is generally transparent to a computer program at the instruction level and can thus be added to a computer architecture without changing the instruction set or requiring modification to existing programs.
Computer processors typically include cache for storing data. When executing an instruction that requires access to memory (e.g., read from or write to memory), a processor typically accesses cache in an attempt to satisfy the instruction. Of course, it is desirable to have the cache implemented in a manner that allows the processor to access the cache in an efficient manner. That is, it is desirable to have the cache implemented in a manner such that the processor is capable of accessing the cache (i.e., reading from or writing to the cache) quickly so that the processor may be capable of executing instructions quickly. Caches have been configured in both on chip and off-chip arrangements. On-processor-chip caches have less latency, since they are closer to the processor, but since on-chip area is expensive, such caches are typically smaller than off-chip caches. Off-processor-chip caches have longer latencies since they are remotely located from the processor, but such caches are typically larger than on-chip caches.
A prior art solution has been to have multiple caches, some small and some large. Typically, the smaller caches would be located on-chip, and the larger caches would be located off-chip. Typically, in multi-level cache designs, the first level of cache (i.e., L0) is first accessed to determine whether a true cache hit for a memory access request is achieved. If a true cache hit is not achieved for the first level of cache, then a determination is made for the second level of cache (i.e., L1), and so on, until the memory access request is satisfied by a level of cache. If the requested address is not found in any of the cache levels, the processor then sends a request to the system""s main memory in an attempt to satisfy the request. In many processor designs, the time required to access an item for a true cache hit is one of the primary limiters for the clock rate of the processor if the designer is seeking a single-cycle cache access time. In other designs, the cache access time may be multiple cycles, but the performance of a processor can be improved in most cases when the cache access time in cycles is reduced. Therefore, optimization of access time for cache hits is critical for the performance of the computer system.
Prior art cache designs for computer processors typically require xe2x80x9ccontrol dataxe2x80x9d or tags to be available before a cache data access begins. The tags indicate whether a desired address (i.e., an address required for a memory access request) is contained within the cache. Accordingly, prior art caches are typically implemented in a serial fashion, wherein upon the cache receiving a memory access request, a tag is obtained for the request, and thereafter if the tag indicates that the desired address is contained within the cache, the cache""s data array is accessed to satisfy the memory access request. Thus, prior art cache designs typically generate tags indicating whether a true cache xe2x80x9chitxe2x80x9d has been achieved for a level of cache, and only after a true cache hit has been achieved is the cache data actually accessed to satisfy the memory access request. A true cache xe2x80x9chitxe2x80x9d occurs when a processor requests an item from a cache and the item is actually present in the cache. A cache xe2x80x9cmissxe2x80x9d occurs when a processor requests an item from a cache and the item is not present in the cache. The tag data indicating whether a xe2x80x9ctruexe2x80x9d cache hit has been achieved for a level of cache typically comprises a tag match signal. The tag match signal indicates whether a match was made for a requested address in the tags of a cache level. However, such a tag match signal alone does not indicate whether a true cache hit has been achieved.
As an example, in a multi-processor system, a tag match may be achieved for a cache level, but the particular cache line for which the match was achieved may be invalid. For instance, the particular cache line may be invalid because another processor has snooped out that particular cache line. As used herein a xe2x80x9csnoopxe2x80x9d is an inquiry from a first processor to a second processor as to whether a particular cache address is found within the second processor. Accordingly, in multi-processor systems a MESI signal is also typically utilized to indicate whether a line in cache is xe2x80x9cModified, Exclusive, Shared, or Invalid.xe2x80x9d Therefore, the control data that indicates whether a true cache hit has been achieved for a level of cache typically comprises a MESI signal, as well as the tag match signal. Only if a tag match is found for a level of cache and the MESI protocol indicates that such tag match is valid, does the control data indicate that a true cache hit has been achieved. In view of the above, in prior art cache designs, a determination is first made as to whether a tag match is found for a level of cache, and then a determination is made as to whether the MESI protocol indicates that a tag match is valid. Thereafter, if a determination has been made that a true tag hit has been achieved, access begins to the actual cache data requested.
An example of a prior art, multi-level cache design is shown in FIG. 4. The exemplary cache design of FIG. 4 has a three-level cache hierarchy, with the first level referred to as L0, the second level referred to as L1, and the third level referred to as L2. Accordingly, as used herein L0 refers to the first-level cache, L1 refers to the second-level cache, L2 refers to the third-level cache, and so on. It should be understood that prior art implementations of multi-level cache design may include more than three levels of cache, and prior art implementations having any number of cache levels are typically implemented in a serial manner as illustrated in FIG. 4. As discussed more fully hereafter, multi-level caches of the prior art are generally designed such that a processor accesses each level of cache in series until the desired address is found. For example, when an instruction requires access to an address, the processor typically accesses the first-level cache L0 to try to satisfy the address request (i.e., to try to locate the desired address). If the address is not found in L0, the processor then accesses the second-level cache L1 to try to satisfy the address request. If the address is not found in L1, the processor proceeds to access each successive level of cache in a serial manner until the requested address is found, and if the requested address is not found in any of the cache levels, the processor then sends a request to the system""s main memory to try to satisfy the request.
Typically, when an instruction requires access to a particular address, a virtual address is provided from the processor to the cache system. As is well-known in the art, such virtual address typically contains an index field and a virtual page number field. The virtual address is input into a translation look-aside buffer (xe2x80x9cTLBxe2x80x9d) 510 for the L0 cache. The TLB 510 provides a translation from a virtual address to a physical address. The virtual address index field is input into the L0 tag memory array(s) 512. As shown in FIG. 4, the L0 tag memory array 512 may be duplicated N times within the L0 cache for N xe2x80x9cwaysxe2x80x9d of associativity. As used herein, the term xe2x80x9cwayxe2x80x9d refers to a partition of the lower-level cache. For example, the lower-level cache of a system may be partitioned into any number of ways. Lower-level caches are commonly partitioned into four ways. As shown in FIG. 4, the virtual address index is also input into the L0 data array structure(s) (or xe2x80x9cmemory structure(s)xe2x80x9d) 514, which may also be duplicated N times for N ways of associativity. The L0 data array structure(s) 514 comprise the data stored within the L0 cache, which may be partitioned into several ways.
The L0 tag 512 outputs a physical address for each of the ways of associativity. That physical address is compared with the physical address output by the L0 TLB 510. These addresses are compared in compare circuit(s) 516, which may also be duplicated N times for N ways of associativity. The compare circuit(s) 516 generate a xe2x80x9chitxe2x80x9d signal that indicates whether a match is made between the physical addresses. As used herein, a xe2x80x9chitxe2x80x9d means that the data associated with the address being requested by an instruction is contained within a particular cache. As an example, suppose an instruction requests an address for a particular data labeled xe2x80x9cA.xe2x80x9d The data label xe2x80x9cAxe2x80x9d would be contained within the tag (e.g., the L0 tag 512) for the particular cache (e.g., the L0 cache), if any, that contains that particular data. That is, the tag for a cache level, such as the L0 tag 512, represents the data that is residing in the data array for that cache level. Therefore, the compare circuitry, such as compare circuitry 516, basically determines whether the incoming request for data xe2x80x9cAxe2x80x9d matches the tag information contained within a particular cache level""s tag (e.g., the L0 tag 512). If a match is made, indicating that the particular cache level contains the data labeled xe2x80x9cA,xe2x80x9d then a hit is achieved for that particular cache level.
Typically, the compare circuit(s) 516 generate a single signal for each of the ways, resulting in N signals for N ways of associativity, wherein such signal indicates whether a hit was achieved for each way. The hit signals (i.e., xe2x80x9cL0 way hitsxe2x80x9d) are used to select the data from the L0 data array(s) 514, typically through multiplexer (xe2x80x9cMUXxe2x80x9d) 518. As a result, MUX 518 provides the cache data from the L0 cache if a way hit is found in the L0 tags. If the signals generated from the compare circuitry 516 are all zeros, meaning that there was no hit generated in the L0 cache, then xe2x80x9cmissxe2x80x9d logic 520 is used to generate a L0 cache miss signal. Such L0 cache miss signal then causes the memory instruction requesting access to a particular address to be sent to the L1 instruction queue 522, which queues (or holds) memory instructions that are waiting to access the L1 cache. Accordingly, if it is determined that the desired address is not contained within the L0 cache, a request for the desired address is then made in a serial fashion to the L1 cache.
In turn, the L1 instruction queue 522 feeds the physical address index field for the desired address into the L1 tag(s) 524, which may be duplicated N times for N ways of associativity. The physical address index is also input to the L1 data array(s) 526, which may also be duplicated N times for N ways of associativity. The L1 tag(s) 524 output a physical address for each of the ways of associativity to the L1 compare circuit(s) 528. The L1 compare circuit(s) 528 compare the physical address output by L1 tag(s) 524 with the physical address output by the L1 instruction queue 522. The L1 compare circuit(s) 528 generate an L1 hit signal(s) for each of the ways of associativity indicating whether a match between the physical addresses was made for any of the ways of L1. Such L1 hit signals are used to select the data from the L1 data array(s) 526 utilizing MUX 530. That is, based on the L1 hit signals input to MUX 530, MUX 530 outputs the appropriate L1 cache data from L1 data array(s) 526 if a hit was found in the L1 tag(s) 524. If the L1 way hits generated from the L1 compare circuitry 528 are all zeros, indicating that there was no hit generated in the L1 cache, then a miss signal is generated from the xe2x80x9cmissxe2x80x9d logic 532. Such an L1 cache miss signal generates a request for the desired address to the L2 cache structure 534, which is typically implemented in a similar fashion as discussed above for the L1 cache. Accordingly, if it is determined that the desired address is not contained within the L1 cache, a request for the desired address is then made in a serial fashion to the L2 cache. In the prior art, additional levels of hierarchy may be added after the L2 cache, as desired, in a similar manner as discussed above for levels L0 through L2 (i.e., in a manner such that the processor accesses each level of the cache in series, until an address is found in one of the levels of cache). Finally, if a hit is not achieved in the last level of cache (e.g., L2 of FIG. 4), then the memory request is sent to the processor system bus to access the main memory of the system.
In view of the above, prior art caches are typically implemented in a serial fashion, with each subsequent cache being connected to a predecessor cache by a single port. Thus, prior art caches have been only able to handle limited numbers of requests at one time. Therefore, the prior art caches have not been able to provide high enough bandwidth back to the Central Processing Unit (CPU) core, which means that the designs of the prior art increase latency in retrieving data from cache, which slows the execution unit within the core of a chip. That is, while an execution unit is awaiting data from cache, it is stalled, which results in a net lower performance for a system""s processor.
These and other objects, features and technical advantages are achieved by a system and method which uses an L1 cache that has multiple ports. The inventive cache uses separate queuing structures for data and instructions, thus allowing out-of-order processing. The inventive cache uses ordering mechanisms that guarantee program order when there are address conflicts and architectural ordering requirements. The queuing structures are snoopable by other processors of a multiprocessor system. This is required because the tags are before the queues in the pipeline. Note that this means the queue contains tag state including hit/miss information. When a snoop is performed on the tags, if it is not also performed on the queue, the queue would believe it has a hit for a line no longer present in the cache. Thus, the queue must be snoopable by other processors in the system.
The inventive cache has a tag access bypass around the queuing structures, to allow for speculative checking by other levels of cache and for lower latency if the queues are empty. The inventive cache allows for at least four accesses to be processed simultaneously. The results of the access can be sent to multiple consumers. The multiported nature of the inventive cache allows for a very high bandwidth to be processed through this cache with a low latency.
The inventive cache uses an issuing mechanism to determine which entries in the queue should issue first and which are ready to issue. The inventive cache uses circuitry that xe2x80x9cfinds the first onexe2x80x9d to determine which access will issue from the queue. Since the cache has multiple ports, more than one access can issue, e.g. having four ports allows for four accesses to issue in the same cycle. Thus, multiple xe2x80x9cfind first onexe2x80x9d circuits operate in parallel to determine the issuing accesses. Note that the multiple circuits may be viewed as a single xe2x80x9cfind first fourxe2x80x9d circuit. These circuits also determine resource conflict among issuing accesses. The inventive cache can also issue accesses that require more than one cycle to complete. The xe2x80x9cfind first onexe2x80x9d circuits also generate a signal that is to be attached to each of those accesses which indicates whether or not this access has all the resources it needs to complete in the issuing clock cycle or whether additional clock cycles will be needed. This signal is referred to as the oversubscribed signal. For example, suppose there are four issuing accesses, two are oversubscribed and two are not, then the two not oversubscribed are issued normally and the two oversubscribed accesses are saved until the resource conflicts clear, and then they are sent to their respective consumers. Further issues that require the same resources are held up until the oversubscribed accesses have been issued, e.g. use the resources that they require. However, other accesses that do not use the same resources, e.g. stores, are allowed to issue on the next clock.
It is a technical advantage of the invention to be able to have at least four accesses at a time going out to the data arrays.
It is another technical advantage to be able to issue resource-conflicted accesses in parallel and still be able to perform them in the next clock.
It is a further technical advantage of the invention to be able to issue more accesses than can be completed with the available resources in parallel. This provides more efficient accesses into memory and, given that multiple resource-conflict areas can exist, it allows the issuing of accesses in the next clock that do not have resource conflicts with the accesses that are now delayed.
It is a still further technical advantage of the invention to provide the capability to pack more accesses into a fixed amount of time.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.