1. Technical Field
Embodiments of the present invention generally relate to computer processors. More particularly, embodiments of the invention relate to the handling of pointer load cache misses.
2. Discussion
In the highly competitive computer industry, the trend toward faster processing speeds and increased functionality is well documented. While this trend is desirable to the consumer, it presents significant challenges to processor designers as well as manufacturers. A particular challenge relates to the management of processor requests to load data items. In modern day processors, a hierarchical memory architecture is used to increase the speed at which data can be retrieved and instructions can be executed. For example, the memory architecture typically has an off-chip portion and on-chip portion. The on-chip portion can be accessed at relatively high speeds and is often referred to as a cache system, such as the cache system 20 shown in FIG. 2. Cache system 20 may be operatively coupled to a processor 100 and a processor bus 102. The processor 100 may be an N-bit processor and typically includes a decoder (not shown) and one or more N-bit registers (not shown). The processor bus 102 may also be coupled to a system logic 104 and a system (or off-chip) memory 106, where the system logic 104 and system memory 106 may communicate directly via bus 108.
The conventional cache system 20 has a level one (L1) cache 22 and a level two (L2) cache 24. By storing items such as instructions, pointer data and computational data in the cache system 20, significant time savings can be achieved for a number of reasons. For example, the cache memory is commonly made out of the static random access memory (SRAM), which can be accessed much faster than the structures used for off-chip memory. Furthermore, the cache memory is in closer physical proximity to the processor 100. The L1 cache 22 can typically be accessed at a higher rate than the L2 cache 24, but is smaller than the L2 cache 24. Thus, if a data access request is received from one of the execution units (not shown) of the processor 100, a memory access request is issued to the L1 cache 22 in order to rapidly return a result to the request. If the data item corresponding to the request is not found in the L1 cache 22, a L1 cache “miss” has occurred and the L2 cache 24 is issued a request. This process is shown in greater detail in the flowchart 26 of FIG. 3. The difficulty arises, however, when the data being operated upon is organized in a linked list of data structures such as the list 28 shown in FIG. 4.
Specifically, each data structure 30 in the list 28 often includes a pointer 32 to the address of the next data structure. The difficulty arises when a first data item such as pointer 32a is not found in the L1 cache or the L2 cache. In such a case, the pointer 32a must be retrieved from off-chip memory 106 (FIG. 2), which typically consumes an undesirably large amount of time. Furthermore, since the data structure 30b corresponding to the address defined by pointer 32a also includes a pointer 32b, Address Z cannot be calculated until data structure 30b is retrieved all the way from off-chip memory. While certain pre-fetching schemes, such as the approach described in U.S. Pat. No. 6,055,622 to Spillinger, can be useful when there is a predictable regularity in the sequence of addresses in the list 28, this regularity does not exist in the described case of a linked list. In such cases, it has been determined that latency can become an issue of particular concern.