In a data processing system, instructions and associated data are transferred from memory to one or more processors for processing, and then resulting data generated by the processor is returned to memory for storage. Thus, typical processing operations involve frequent and repetitive reading and writing from memory. As a result, memory access delays are often a primary limitation in the performance of a data processing system. Preferably, therefore, memory access speed should be maximized to maximize performance. However, often cost and other constraints require that the main memory be comprised of relatively long access time circuitry. To overcome the resulting performance drawbacks, memory caches are typically used.
A memory cache typically includes a relatively small, but high speed, bank of memory, which can be more rapidly accessed by the processor(s) than the main memory. Memory locations in the main memory are duplicated in the cache. When a particular memory location being accessed by the processor is duplicated in the cache--event which is known as a cache "hit"--the processor may rapidly access the cache instead of waiting for access to main memory. The cache is managed with the goal of maximizing the fraction of accesses which are hits in the cache.
Caches are typically organized into "lines", which are relatively long sequences of memory locations found in main memory. Typically, when a memory location accessed by a processor is not duplicated in the cache--an event which is known as a cache "miss"--an entire line containing the missed memory location, and neighboring memory locations, is brought into the cache as part of retrieving the missed location from other caches or main memory--an event which is known as a "linefill" into the cache.
Typically, each cache line is associated with multiple groups of locations in the main memory. Each cache line stores duplicates of associated groups of memory locations, as well an indication of which groups of memory locations are currently stored in that line. Thus, when a processor requests access to a particular memory location, the cache line corresponding to that memory location is accessed to determine whether that cache line is storing the group of memory locations which includes the requested location. If so, the requested memory location is accessed in the cache. If not, a group of memory locations including the requested location is linefilled into the cache.
Typically, an n-way associative cache stores n of the several groups of locations corresponding to a cache line in the cache at one time. When a group of memory locations is linefilled into the cache, memory contents in the same cache line may need to be replaced. If the contents of the replaced cache line have been modified, then the line has to be stored back into the corresponding group of locations in the main memory--an event which is known as a "writeback" from the cache.
In high performance data processing systems, often there are two or more caches, organized so that a processor attempts to access a memory location by first attempting to locate a duplicate of that location in a "level 1" or L1 cache. If there is a miss in the L1 cache, then an attempt is made to locate a duplicate of the desired memory location in a "level 2" or L2 cache. If there is a miss in the L2 cache, each lower level cache is sequentially checked in the same manner. If there is a hit in one of the caches, then the desired memory locations are obtained from that cache, and typically, the accessed memory locations are duplicated, along with neighboring locations completing a cache line, into the appropriate line of at least the L1 cache--although in some cases an access may be "cache-inhibited", in which case the data is not stored in the L1 cache after retrieval. If there are misses in all of the caches, the missed location, along with neighboring locations completing a cache line, is retrieved from main memory, and filled into one or more of the caches if the access is not cache-inhibited. Similarly, if a line is written back from a cache, the line may be written to a lower level cache, main memory, or both.
Typically, lines of instructions and data are transferred from caches and processors to other caches and processors using buffers. For instance, in one architecture two buffers are respectively connected to a level 1 cache and a level 2 cache. These buffers are also connected to main memory, a host processor, and possibly other processors via a system bus. The buffers allow for a smooth transition of data or instructions between components having different transfer rates. Each line in a conventional cache buffer strictly handles either fill commands or write back commands, and includes memory space which can store a finite number of cache lines, e.g., four. Each cache line in a buffer is, therefore, designated as a fill cache line or a write back cache line. In a multi-way associative cache, cache buffer lines may be used for fills or writebacks, and are dynamically configured for the appropriate purpose.
In addition to the use of caches to improve memory access performance, other well known techniques have been used to improve the performance of data processing systems. One technique is to divide a processing task into independently executable sequences of instructions called threads. Using this technique, when a processor, for any number of reasons, cannot continue processing or execution of a thread, the processor switches to another thread and continues processing of that thread, rather than stalling. For example, when a cache miss stalls processing of one thread, the processor may switch to other threads which are able to continue processing. By the time the processor returns to the stalled thread, the missed location may have been linefilled into the cache, and that thread can resume with minimal additional delay.
The term "multithreading", when used in the context of software, is used to refer to a particular organization scheme which can be used in writing computer programs. In this software context, therefore, "multithreading" does not relate to the manner in which computer programs are physically executed by the processor. Thus, software "multithreading" is different from the kind of "multithreading" discussed in this application. The kind of multithreading to which this application relates, may be referred to as "hardware multithreading", i.e., processor configurations permitting a processor to switch between multiple threads of instructions upon various conditions. Thus, in this application, the terms "thread" and "multithreading" will be understood to refer to "hardware multithreading", and not methods for software organization.
While the technique of multithreading, and the use of memory caches, both enhance the performance of a data processing system, combining these techniques raises substantial complexities. In particular, as noted above, when a memory access misses in the L1 cache, the required memory contents must be accessed from a higher level cache or main memory. When the memory contents are located, they must be delivered to the requesting processor, and also filled into one of the caches. While this process appears straightforward, in a multithreading environment or other environments where there may be multiple outstanding cache misses, efficient management of this process becomes substantially more difficult.
In particular, when multiple threads each have outstanding cache misses, and the content of a memory location is returned in response to a miss, several determinations must be made. First, a determination must be made which cache miss the memory location is for and which thread is awaiting that memory location. This first determination is complicated by the fact that two different memory locations may be associated with the same cache line; thus, two threads may be awaiting delivery of different memory locations into the same cache line. It is therefore necessary to confirm that memory contents being returned to a cache line relate to the specific memory location that caused a thread's cache miss.
Furthermore, once it is known which thread the memory contents are intended for, a determination must be made as to how to deliver the memory contents to the requesting thread, i.e., directly from the returning higher level cache or memory, via the cache buffer, or from the L1 cache. Memory access times can be substantially reduced if, at least in some circumstances, memory contents returning from a higher level cache or main memory can be directly delivered to the awaiting processor, or delivered from the cache buffer, without first waiting for an entire cache line to be filled into the L1 cache. However, for this to be possible, the determination of which thread the memory contents are intended for, must be made essentially simultaneously with the return of the cache line from main memory, which may not be possible using known cache management schemes.