1. Field of the Invention
This invention relates generally to caches in computer architectures and more specifically to multiple cache sets with specialized functionality.
2. Description of the Related Art
Microprocessor systems include various types of memory which store the instructions and data by which the microprocessor operates. The memory is organized along the lines of a general hierarchy which is illustrated in FIG. 1. The hierarchy is organized in order of increasing memory access time with the memory level having the fastest access time being positioned relatively closer to the central processing unit (CPU) of the microprocessor system. Registers are the fastest memory devices and are generally internal architecture units within the microprocessor. Toward the middle level is main memory which is typically constructed using semiconductor memory devices, such as random access memory (RAM) chips, which are directly accessed by the microprocessor through an external bus. Mass storage represents relatively large amounts of memory that are not directly accessed by the microprocessor, such as magnetic disks or CDROM, and which is typically much slower to access than main memory. Archival storage represents long-term memory which typically requires human intervention for access, such as the loading of a magnetic tape.
In addition, microprocessor systems typically include cache memory at a level in between the registers and main memory which contains copies of frequently used locations in main memory. For each entry in a cache memory, there is a location to store the data and a tag location that identifies the corresponding location in main memory with which the data is associated. When the microprocessor outputs an address value on the memory bus at the beginning of a data access cycle, the address value is compared to the tags in cache memory to determine whether a match exists. A match of an address value to a cache tag is called a cache hit and the data is accessed in cache rather than main memory.
Cache memory is relatively small and fast as compared to main memory, but is also more expensive on a per bit basis. When a microprocessor can operate at higher speeds than main memory, then processor cycles can be saved and performance improved by including cache in the memory hierarchy of the microprocessor subsystem. To improve performance and reduce cost, the local memory in a microprocessor typically includes one or more cache devices.
FIG. 2A illustrates an example of a conventional microprocessor 10 whose local memory includes a cache 50 and main memory 80. In the course of operation of microprocessor 10, small portions of data from main memory 80 are moved into cache 50 for fast access by CPU 20 via CPU data bus 22. Subsequent accesses by CPU 20 to the same data are made to the cache 50 rather than main memory 80. A cache controller 30 monitors the data accesses made by CPU 20 and determines whether the desired data is resident in cache 50, main memory 80 such as CD-ROMs or mass storage disks or in other storage devices. The cache controller 30 also moves data between cache 50 and main memory 80 such as based upon the data accesses requested by CPU 20 and the cache replacement policy designed into the cache controller. There is overhead time associated with the data management activities of cache controller 30, but, ideally, the cache overhead is outweighed by the advantage gained from the lower access time of the cache devices.
Typically, cache controller 30 is connected to main memory 80 via a main memory data bus 82 and a separate cache data bus 32 which connects it to cache 50. In response to a data access from CPU 20, the cache controller 30 will generally attempt to find the data in cache 50. If the data is not found in cache 50, i.e. a cache miss occurs in cache 50 and is communicated back to cache controller 30, then cache controller 30 will attempt to find the data in main memory 80. CPU 20 can also be configured to perform a cache bypass memory access wherein the CPU sends a bypass control directive to cache controller 30 which causes the data access to go directly to main memory 80 to find the data thereby bypassing cache 50.
Microprocessors are sometimes designed with multiple sub-layers of cache, as is also illustrated in FIG. 2A. Cache 50 is divided into a first level cache 52 and a second level cache 56. The first level cache 52 will typically be a smaller, faster and more expensive per bit device than the larger second level cache 56. The first level cache will also typically maintain data at a finer level of granularity than second level cache 56. Cache devices are typically arranged in terms of lines, where the line is one or more data words and is the unit of data brought in on a miss. Thus, the first level cache may have a line length of just one or two words, while the second level cache will have a line length on the order of eight or sixteen words. In the multiple level cache structure, cache controller 30 controls the population and replacement of data between the two levels of cache and main memory.
Caches are typically designed to exploit temporal and spatial locality in the program under execution by the microprocessor 10. Temporal locality is the tendency of programs to access a piece of data multiple times within a relatively short interval of time. By moving the piece of data from main memory 80 to cache 50, the microprocessor can take advantage of temporal locality to reduce the time required to access the piece of data for later accesses. Spatial locality is the tendency of programs to make subsequent accesses to data which is located nearby the data which has recently been accessed, i.e. an access to one portion of a block or line of data will likely be followed by accesses to other portions of the same block or line of data.
However, different types of data can exhibit highly divergent access characteristics. For instance, some types of data, such as image or audio data, get processed by walking through the data once without repetitive access. This highly spatial data also tends to be in the form of blocks or pages of relatively large size. As the spatial data is sequentially accessed by CPU 10, cache controller 30 will stream the spatial data into cache 50 thereby replacing the data already present in the cache. Streaming in a block of this spatial data tends to occupy the entire cache space with data which will not be subsequently accessed or will not be accessed for a significant period of time. Other data which would have been beneficial to keep in cache is thus flushed out and the efficiency and efficacy of the cache function is undermined.
As a simple example, consider the case where the size of cache 50 is 32 Kbytes and the block size of some highly spatial data is 16 Kbytes. Access to a first block of spatial data will overwrite 1/2 (i.e. 16 Kbytes divided by 32 Kbytes) of the contents of cache 50. The first block of spatial data is likely to be retained based upon a cache replacement policy which assumes temporal locality, even though the first block may not be accessed again or may not be accessed for a significant period of time. Access to a second block of spatial data then causes the remaining 1/2 of the contents of cache 50 to be overwritten. Thus, by accessing two blocks of spatial data, cache 50 is completely flushed of its previous contents.
The cache flushing problem is quite pronounced for specialized data types having very large block sizes. For instance, image data commonly has block sizes of 512 Kbytes to 100 Mbytes. Each block not only flushes other data from the cache, but also flushes their own lines of data when the block size is larger than the cache size. Another example of the cache flushing problem arises with regard to the tables that are used in processing an image stream. The tables will typically be replaced by the data of the image stream unless separate buffers are used for the table and image data. Processing an image will generally require several stream buffer sets because multiple streams are used to process a single stream. For example, when an image is scaled, a current line of data and a previous line of data are used to interpolate the lines in between. In other cases, several lines may be merged to produce a new line of data. It is conceivable that eight or sixteen large stream cache sets may be useful for processing image data.
In addition, some types of spatial data have the characteristic of being accessed at regular, though relatively long, intervals. For instance, the data for a particular image may be read out to a display from start to finish in order to feed a raster scan. The subsequent image in a series of images may then only require that a relatively small subset of data positions be updated in the data for the predecessor image. It is therefore advantageous to maintain the relatively large, infrequently accessed image data in cache in preparation for output of the next image. However, in between times when the image data is being accessed, accesses by CPU 10 to other types of data can cause cache controller 30 to replace some of the blocks of the image data resulting in much of the same data having to be reloaded into cache 50.
One solution to this conflict between temporal and spatial data in a cache is to include "spatial locality only" hints in load and store instructions to indicate that the data exhibits spatial locality but not temporal locality. Kurpanek et al describe such a solution in "PA7200: A PA-RISC Processor with Integrated High Performance MP Bus Interface", 1063-6390/94, pp. 375-382, IEEE, 1994. When data marked as "spatial locality only" is accessed, it is read into an assist cache, just as is temporal data. Upon replacement, however, the "spatial locality only" data is flushed back to main memory whereas the temporal data is moved into a main data cache for subsequent access. This prevents spatial data from polluting the cache, but makes no effort to cache the spatial data for further use. The assist cache, however, is still polluted with spatial data.
Another solution to cache pollution caused by spatial data is to provide multiple caches at the same level in the hierarchy of the microprocessor architecture. By dividing the cache into multiple caches which are then assigned to support particular types of data, the characteristics of each cache (e.g. line size, number of lines, replacement policy/priority, write policy, prefetch capability, access latency, etc.) can be optimized for the type of data stored within the cache. For instance, Rivers and Davidson describe a non-temporal streaming (NTS) cache in their article entitled "Reducing Conflicts in Direct-Mapped Caches with a Temporality-Based Design", pp. I-154 to I-163, International Conf. on Parallel Processing, IEEE, 1996.
FIG. 2B illustrates a microprocessor 210 where cache 250 has multiple cache devices at the same level. Data cache 252 and stream buffer cache 254 reside at the same hierarchical level. However, stream buffer cache 254 is assigned to support the streaming of large blocks of data which are known to be spatial in nature. When a block of data is known to be spatial by the programmer, compiler or hardware, then the block is loaded into stream buffer cache 254 instead of data cache 252. The result is that the spatial data stored in stream buffer cache 254 is prevented from polluting the contents of data cache 252.
For example, stream buffer cache 254 can be used to store the data for a sequence of image frames. The data for a first frame is streamed into stream buffer cache 254 and then scanned to output the frame to a display device. The next frame in the sequence is largely the same as the first frame, so only a portion of the data from the first frame needs to be overwritten with the data for the second frame. Thus, only portions of the data for the second frame need to be read into stream buffer cache 254. Meanwhile, temporal data is cached in data cache 252 without overwriting the data for the first image frame and without data for the first or second image frames overwriting the temporal data in data cache 252. It can therefore be seen that caches assigned to specific functions can enhance cache function in a microprocessor.
However, the presence of multiple caches can lead to data concurrency problems. Since data can have temporal and spatial access characteristics in different portions of processing, the same data may end up residing in several caches at the same time which can result in different versions of the data existing concurrently, if the data in one or more caches has been modified.
Rivers and Davidson avoid problems with data coherency in their NTS structure by maintaining only one version of the data in cache. A given set of data can exist in the main cache or the NTS cache, but not both. The NTS scheme requires the use of a nontemporal data detection unit which monitors the references to data and maintains an NT bit which indicates whether the data is non-temporal or not based upon whether the data was rereferenced while in cache. The NTS scheme also requires a secondary cache wherein data blocks that are replaced in the main cache or NT cache are maintained along with their NT bits. A subsequent hit on data in the secondary cache results in the block being moved into the main cache when the NT bit is clear, and into the NT cache when the NT bit is set. If the data is not in the second level cache, then the line is brought in from main memory and no NT information is known. Thus, the NTS scheme solution is based upon dynamic monitoring of the references to a block of data in cache and involves overhead in the form of the storage of an NT bit for each block as well as a secondary cache.
The data concurrency problem can also arise due to the existence of different interfaces to the data present in the software architecture. Generalized data interface routines and instructions may access data at the lowest level in the memory hierarchy in order to ensure that the routine or instruction can be used with the widest variety of calling programs. On the other hand, in order to exploit specialized caches, routines or instructions with specialized data access may be necessary and give rise to a separate interface to the data.
FIG. 3 illustrates an example of the cache coherency problem which arises with the introduction of a separate specialized buffer into the microprocessor architecture. A software data architecture 300 is shown where a main program 310 can call two separate subroutines which interpret data accesses in different ways. Subroutine A 322 interprets a data access data(Z) from main program 310 to be a non-specialized data access to a data set Z and the routine therefore looks first to data cache 252 for data Z and then to main memory 80. Data set Z, if resident in data cache 252, may have been modified so that a modified data set Z' resides in data cache 252.
Subroutine B 324, on the other hand, is structured to interpret a data access data(Z) from main program 310 to be a reference to data in stream buffer. The subroutine therefore looks first to stream buffer cache 254 and then to main memory 80 for the existence of data set Z. If data set Z is resident in stream buffer cache 254 and has been modified without a corresponding update to main memory, then another modified version Z" resides in the stream buffer.
Thus, there are two paths, path A and path B, which main program 310 can take to access a given set of data. The different types of caches therefore creates data coherency problems which must be maintained at the program level through data or code design. Maintaining data coherency at the program level is a complex and error-prone task which has discouraged the use of specialized caches and hindered their proliferation in microprocessor architectures.
The data coherency problem described above is further aggravated by the proliferation of different types of specialized caches in microprocessor architectures. FIG. 4 illustrates a few examples of the types of caches which now appear in conventional microprocessors. In addition to the normal data cache and stream buffer cache, a scratch pad 456 can be included to provide a calculation workspace for a process under execution. Furthermore, a code assignable cache 458 can be provided with characteristics which can be flexibly assigned by the process under execution to, for example, hold tables used in computation. In addition, a cache bypass operation can be viewed abstractly as yet another type of specialized cache set 457 which always results in a cache miss and an access to main memory.
Accordingly, the need remains for a system and method for accessing data which may reside in multiple specialized caches in a microprocessor architecture that is simple and efficient to use.