Prior art cache designs for processors typically implement one or two level caches. More recently, multi-level caches having three or more levels have been designed in the prior art. Of course, it is desirable to have the cache implemented in a manner that allows the processor to access the cache in an efficient manner. That is, it is desirable to have the cache implemented in a manner such that the processor is capable of accessing the cache (i.e., reading from or writing to the cache) quickly so that the processor may be capable of executing instructions quickly and so that dependent instructions can receive data from cache as soon as possible.
An example of a prior art, multi-level cache design is shown in FIG. 1. The exemplary cache design of FIG. 1 has a three-level cache hierarchy, with the first level referred to as L0, the second level referred to as L1, and the third level referred to as L2. Accordingly, as used herein L0 refers to the first-level cache, L1 refers to the second-level cache, L2 refers to the third-level cache, and so on. It should be understood that prior art implementations of multi-level cache design may include more than three levels of cache, and prior art implementations having any number of cache levels are typically implemented in a serial manner as illustrated in FIG. 1. As discussed more fully hereafter, multi-level caches of the prior art are generally designed such that a processor accesses each level of cache in series until the desired address is found. For example, when an instruction requires access to an address, the processor typically accesses the first-level cache L0 to try to satisfy the address request (i.e., to try to locate the desired address). If the address is not found in L0, the processor then accesses the second-level cache L1 to try to satisfy the address request. If the address is not found in L1, the processor proceeds to access each successive level of cache in a serial manner until the requested address is found, and if the requested address is not found in any of the cache levels, the processor then sends a request to the system""s main memory to try to satisfy the request.
Typically, when an instruction requires access to a particular address, a virtual address is provided from the processor to the cache system. As is well-known in the art, such virtual address typically contains an index field and a virtual page number field. The virtual address is input into a translation look-aside buffer (xe2x80x9cTLBxe2x80x9d) 10 for the L0 cache. The TLB 10 provides a translation from a virtual address to a physical address. The virtual address index field is input into the L0 tag memory array(s) 12. As shown in FIG. 1, the L0 tag memory array 12 may be duplicated N times within the L0 cache for N xe2x80x9cwaysxe2x80x9d of associativity. As used herein, the term xe2x80x9cwayxe2x80x9d refers to a partition of the lower-level cache. For example, the lower-level cache of a system may be partitioned into any number of ways. Lower-level caches are commonly partitioned into four ways. As shown in FIG. 1, the virtual address index is also input into the L0 data array structure(s) (or xe2x80x9cmemory structure(s)xe2x80x9d) 14, which may also be duplicated N times for N ways of associativity. The L0 data array structure(s) 14 comprise the data stored within the L0 cache, which may be partitioned into several ways.
The L0 tag 12 outputs a physical address for each of the ways of associativity. That physical address is compared with the physical address output by the L0 TLB 10. These addresses are compared in compare circuit(s) 16, which may also be duplicated N times for N ways of associativity. The compare circuit(s) 16 generate a xe2x80x9chitxe2x80x9d signal that indicates whether a match is made between the physical addresses. As used herein, a xe2x80x9chitxe2x80x9d means that the data associated with the address being requested by an instruction is contained within a particular cache. As an example, suppose an instruction requests an address for a particular data labeled xe2x80x9cA.xe2x80x9d The data label xe2x80x9cAxe2x80x9d would be contained within the tag (e.g., the L0 tag 12) for the particular cache (e.g., the L0 cache), if any, that contains that particular data. That is, the tag for a cache level, such as the L0 tag 12, represents the data that is residing in the data array for that cache level. Therefore, the compare circuitry, such as compare circuitry 16, basically determines whether the incoming request for data xe2x80x9cAxe2x80x9d matches the tag information contained within a particular cache level""s tag (e.g., the L0 tag 12). If a match is made, indicating that the particular cache level contains the data labeled xe2x80x9cA,xe2x80x9d then a hit is achieved for that particular cache level.
Typically, the compare circuit(s) 16 generate a single signal for each of the ways, resulting in N signals for N ways of associativity, wherein such signal indicates whether a hit was achieved for each way. The hit signals (i.e., xe2x80x9cL0 way hitsxe2x80x9d) are used to select the data from the L0 data array(s) 14, typically through multiplexer (xe2x80x9cMUXxe2x80x9d) 18. As a result, MUX 18 provides the cache data from the L0 cache if a way hit is found in the L0 tags. If the signals generated from the compare circuitry 16 are all zeros, meaning that they are no hits, then xe2x80x9cmissxe2x80x9d logic 20 is used to generate a L0 cache miss signal. Such L0 cache miss signal then triggers control to send the memory instruction to the L1 instruction queue 22, which queues (or holds) memory instructions that are waiting to access the L1 cache. Accordingly, if it is determined that the desired address is not contained within the L0 cache, a request for the desired address is then made in a serial fashion to the L1 cache.
In turn, the L1 instruction queue 22 feeds the physical address index field for the desired address into the L1 tag(s) 24, which may be duplicated N times for N ways of associativity. The physical address index is also input to the L1 data array(s) 26, which may also be duplicated N times for N ways of associativity. The L1 tag(s) 24 output a physical address for each of the ways of associativity to the L1 compare circuit(s) 28. The L1 compare circuit(s) 28 compare the physical address output by L1 tag(s) 24 with the physical address output by the L1 instruction queue 22. The L1 compare circuit(s) 28 generate an L1 hit signal(s) for each of the ways of associativity indicating whether a match between the physical addresses was made for any of the ways of L1. Such L1 hit signals are used to select the data from the L1 data array(s) 26 utilizing MUX 30. That is, based on the L1 hit signals input to MUX 30, MUX 30 outputs the appropriate L1 cache data from L1 data array(s) 26 if a hit was found in the L1 tag(s) 24. If the L1 way hits generated from the L1 compare circuitry 28 are all zeros, indicating that there was no hit generated in the L1 cache, then a miss signal is generated from the xe2x80x9cmissxe2x80x9d logic 32. Such a L1 cache miss signal generates a request for the desired address to the L2 cache structure 34, which is typically implemented in a similar fashion as discussed above for the L1 cache. Accordingly, if it is determined that the desired address is not contained within the L1 cache, a request for the desired address is then made in a serial fashion to the L2 cache. In the prior art, additional levels of hierarchy may be added after the L2 cache, as desired, in a similar manner as discussed above for levels L0 through L2 (i.e., in a manner such that the processor accesses each level of the cache in series, until an address is found in one of the levels of cache). Finally, if a hit is not achieved in the last level of cache (e.g., L2 of FIG. 1), then the memory request is sent to the processor system bus to access the main memory of the system.
Multi-level cache designs of the prior art are problematic in that such designs require each level of cache to be accessed in series until a xe2x80x9chitxe2x80x9d is achieved. That is, when an address is requested, each level of cache is accessed in series until the requested address is found within the cache (or it is determined that the requested address does not reside within cache, wherein a request for the address is then made to the system""s main memory). Accordingly, if a requested address is residing in the L2 cache structure, the request must first be checked in the L0 cache, and then next in the L1 cache, in a serial manner, before it can begin the access into the L2 cache. Therefore, the more levels of cache implemented within a design generally increases the amount of time required to access a higher-level cache (e.g., the third-level cache L2 or higher) because of the serial nature of accessing each cache level one by one.
A further problem with prior art multi-level cache designs is that such prior art designs typically look up the tag and the data for a particular cache level in parallel in an attempt to improve the access time to that cache level. For example, in an attempt to improve the access time, a prior art implementation would typically perform a tag lookup for cache level L1 utilizing the L1 tag 24, while also looking up the desired data in L1 data array 26 in parallel. Accordingly, if the desired address is found in the L1 tag 24, the data from L1 data array 26 may be readily available because the lookup in the L1 data array 26 was performed in parallel with the tag lookup. However, with such prior art design, more of the data array (e.g., the L1 data array 26) is powered up than is necessary. For, example, assume that a four-way associative cache data structure is implemented. Prior art designs power up all four ways of the data array to lookup the desired data in parallel with performing the tag lookup. At best, only one of the four ways will need to be accessed for the desired address (assuming that the desired address is found within that level of cache), and possibly none of the four ways will need to be accessed for the desired address (if the desired address is not found within that level of cache). Accordingly, such prior art design wastes the power that is utilized to power up every way of a data array unnecessarily. Moreover, the resources of the data array are wasted in such prior art designs because each way of the data array is accessed without fully utilizing the resources of each way of the data array. That is, prior art designs typically access every way of a data array, thereby tying up the resources of every way of the data array (i.e., preventing those resources from being accessed by other instructions), while at best only utilizing one way of the data array and possibly not utilizing any of the ways of the data array (i.e., if the desired address is not found within that level of cache). Therefore, prior art designs tie up cache resources unnecessarily, thereby wasting resources that may potentially be used to satisfy other instructions.
Additionally, certain instructions encountered within a system only need to access the tags of the cache, without requiring access to the cache""s data array. For example, snoops off of a system bus need to access the tags to find out if a certain cache line is resident in any of the levels of cache. As used herein a xe2x80x9csnoopxe2x80x9d is an inquiry from a first processor to a second processor as to whether a particular cache address is found within the second processor. A high percentage of the time, the tag access for a snoop will indicate that the cache line is not present, so no data access is necessary. However, as discussed above, prior art cache designs are typically implemented such that the cache data array is accessed in parallel with the tag lookup. Therefore, a snoop access of the tag typically wastes the resources of the data array because most of the time access to the data array is not needed. Furthermore, system bus snoops generally require a very quick response. Accordingly, the serial access design of prior art caches may result in a greater response time than is required by the system for responding to a snoop. Therefore, the multi-level cache designs of the prior art may negatively impact the time required to satisfy requests, such as snoop requests, that only require access to the cache""s tags.
In view of the above a desire exists for a cache design that allows for upper-level caches (e.g., level two or higher) to be accessed in a timely manner. A further desire exists for a cache design that does not unnecessarily access the cache""s data array, thus not wasting power or resources of the data array. Yet a further desire exists for a cache design that performs requests that require access only to the cache""s tags, such as snoops, in a timely manner and in a manner that does not unnecessarily waste power or resources of the cache""s data array.
These and other objects, features and technical advantges are achieved by a system and method which determine in parallel for multiple-levels of a multi-level cache whether any one of such multiple levels is capable of satisfying a memory access request. That is tags for multiple levels of a multi-level cache are accessed in parallel to determine whether the address for a memory access request is contained within any of the tiple levels. Thus, the tag accesses for multiple levels of a multi-level cache are performed early in the pipeline of the cache hierarchy to provide an early lookup of the tags. For instance, in a preferred embodiment the tags for the first level of cache and the tags for the second level of cache are accessed in parallel. Also, additional levels of cache tags up to N levels may be accessed in parallel with the first-level cache tags. As a result, as tags for the first-level cache are being accessed, tags for other levels of the cache are being accessed in parallel, such that by the end of the access of the first-level cache tags it is known whether a memory access request can be satisfied by the first-level, second-level, and any additional N-levels of cache that are accessed in parallel.
Additionally, in a preferred embodiment, the multi-level cache is arranged such that the data array of a level of cache is accessed only if it is determined that such level of cache is needed to satisfy a received memory access request. Accordingly, in a preferred embodiment, the data arrays of the multi-level cache are not unnecessarily accessed. For instance, in a preferred embodiment, the tag access is performed separate from the data array access for a level of cache, and the data array for that level of cache is not accessed if a hit is not achieved within that level""s tags (i.e., the data is not present in that level of the cache). This has the advantage of saving power because the data arrays are not powered up unnecessarily. That is, unused memory banks (or data arrays) may not be powered up in a preferred embodiment, thereby reducing power consumption. Of course, in alternative embodiments, the unused memory banks may still be powered up, and any such embodiment is intended to be within the scope of the present invention. Also, it results in preserving the data array resources so that they may be used by other instructions, rather than unnecessarily wasting them by accessing them when the data is not present in that level of cache. Also, requests that require access only to the cache tags, such as snoop requests, do not cause the data array resources to be accessed because the tag access is separated from the data array access, in a preferred embodiment. Accordingly, in a preferred embodiment, requests that require access only to the cache""s tags are performed in a timely manner and in a manner that preserves power and data array resources because the cache is implemented such that an access to the data array is not performed for such requests.
Additionally, in a preferred embodiment the multi-level cache is partitioned into N ways of associativity. Most preferably, the multi-level cache is partitioned into four ways. In a preferred embodiment, only a single way of a data array is accessed to satisfy a memory access request. That is, in a preferred embodiment, the cache tags for a level of cache are accessed to determine whether a requested memory address is found within a level of cache before such level""s data array is accessed. Accordingly, when the data array for a level of cache is accessed, it is known in which way the desired data is residing. Therefore, only the one data array way in which the data resides is powered up and accessed to satisfy the access request. Because the other ways of the data array that are not capable of satisfying the request are not accessed, a saving in power and data array resources may be recognized.
In a most preferred embodiment, the multi-level cache may be implemented to allow maximum utilization of the cache data arrays. As discussed above, in a preferred embodiment, cache data arrays are not accessed unnecessarily. Accordingly, in a preferred embodiment, cache data arrays that are not capable of satisfying one request remain free to be accessed for another request. In a most preferred embodiment, the multi-level cache may be implemented such that multiple instructions may be satisfied by the cache data arrays in parallel. That is, the resources that remain free may be utilized to satisfy other memory access requests in an efficient manner.
It should be understood that in a preferred embodiment a multi-level cache structure is provided that is capable of receiving and satisfying any type of access request from an instruction, including a read request, write request, and read-modify-write request. It should be appreciated that a technical advantage of one aspect of the present invention is that a multi-level cache structure is implemented to allow for multiple levels of cache tags to be accessed in parallel, thereby allowing a determination to be made quickly and efficiently as to whether a memory access request can be satisfied by any one of such multiple levels of cache. By implementing the tag access of upper-level caches (e.g., level two or higher) early in the pipeline, fast access to higher-level cache or beyond cache (e.g., access to main memory) is achieved. A further technical advantage of one aspect of the present invention is that a multi-level cache design of a preferred embodiment does not unnecessarily access the cache""s data array, thus preserving power and resources of the data array. Yet a further technical advantage of one aspect of the present invention is that a multi-level cache design of a preferred embodiment performs requests that require access only to the cache""s tags, such as snoops, in a timely manner and in a manner that does not unnecessarily waste power or resources of the cache""s data array.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.