1. Field of the Invention.
This invention relates in general to processor caching methods, and more particularly to a method and apparatus for increasing processing speed in a computing system by optimizing the hit ratio of requests from requesting devices by providing the processor with a non level one (L1) information cache that prefetches information stored therein to increase the hit ratio.
2. Description of Related Art
There has been a dramatic increase in the amount and type of data that computing systems are processing of late. Computing systems routinely process two dimensional and three dimensional images, graphics, audio and video media. Networking has allowed information to be shared throughout the world and consumers demand seamless access to data and a high level of performance of media containing vast quantities of data. Thus, computing systems are being required to perform more demanding tasks to satiate consumer's media hunger.
In order to increase performance, processors may be provided with embedded caches to store data logically and physically closer to the processor. An embedded cache operates at the processor frequency and therefore allows access to information, such as instructions or data, more quickly than external caches.
Many computing systems like storage controllers, routers and servers use processors to control various hardware components. The processors run real time operating systems, handle interrupts, set up direct memory access transfers, check control information for validity, translate addresses and perform other functions. Because these functions are in the critical functional path, the overall performance of these routines is greatly influenced by processing speed.
Numerous major factors contribute to processing speed. One such factor is the core operating frequency of the processor. Another factor is the amount and type of level 1 (L1) data and instruction caches resident on the processor. Caches are classified by the level they occupy in the memory hierarchy. Early computers employed a single, multichip cache that occupied one level of the hierarchy between the processor and the main memory. Two developments made it desirable to introduce two or more cache levels in a high performance system: the feasibility of including part of the real memory space on a microprocessor chip and growth in the size of main memory in computers. A level one (L1) or primary cache is an efficient way to implement an on-chip memory.
An additional factor influencing processor speed is the amount and type of level 2 (L2) caches present, if any. An additional memory level can be introduced via either on-chip or off-chip level two (L2) secondary cache. The desirability of an L2 cache increases with the size of main memory. As main memory size increases further, even more cache levels may be desirable. The L1 cache is higher in the cache hierarchy than the L2 cache. The L1 cache contains less information than the L2 cache and all the data and/or instructions that are stored on the L1 cache are also stored on the L2 cache.
The type and stages of the data transfer pipeline within the processor is another important factor affecting processing speed. Another important factor contributing to processor speed is the number of instructions which can be executed simultaneously by the processor.
Effective cache subsystems will desirably provide instruction and data availability with minimum latency. A processor or another information requesting device requests a specific access (piece of information or data). If the access is immediately available in the cache, the request is considered a hit. However, if the access is not already present and available in the cache, this is considered a miss.
By way of definition, a hit ratio is a measure of the probability that an access will be resident in a particular cache. High hit ratios result in lower processing times for similar units of work. That is, if L1 caches run at processor speeds and have the capacity to contain the entire code load, including all necessary peripheral data and instructions, then the resulting processing time would be the smallest time possible. The processor would then be operating at maximum or peak performance.
However, the reality is that modern code loads for complex programs and systems are very large, often many megabytes. Therefore, it is impractical to provide processors with embedded L1 caches having such large capacities. For example, practical constraints have limited L1 caches in processors to 32K bytes or less in most cases. A split L1 cache contains both a 32K data cache and a 32K instruction cache. Instruction hit ratios using economically feasible L1 capacities currently available have tended to be disappointingly low. The probability that the first access to a cache line is a hit is very low. Once the cache line is fetched, then there may be up to N consecutive hits, where N represents the average number of sequential instructions processed before a taken branch is executed.
To overcome this disadvantage, processors having embedded L2 caches, in addition to the smaller capacity embedded L1 caches disposed therein and which run at processor speed, are desirable. Processors having embedded L2 caches running at processor speeds provide significant increases in performance while meeting requirements for cost, power and space. Bearing the power, cost and space requirements in mind, an L2 cache having 256K to 512K bytes of memory can be placed on a processor. Unfortunately, many L2 subsystems are only 2 way set associative. This means that for a given tag there are only 2 addresses stored in the cache for that tag. The stored addresses may be referred to as the way or the index. In a complex program or system having lots of branches and lots of subroutine calls, this sort of cache can detract significantly from the hit ratio because a low hit ratio results from the large number of addresses fetched having the same tag and thereby competing for the very limited number of address slots or ways.
Therefore, due to size limitations and the type of L2 cache, the misses may still represent a sizable portion of the fetches done by the processor. A miss will result in fetching from the next level of memory. This can mean significantly more CPU cycles, e.g., as many as 75 CPU cycles or more, to fetch a cache line. Of course, the cycle time is longer for accesses from main memory than for access from embedded caches.
Further complicating the main memory access times is the desire for these systems to have a shared memory between the processor(s) and data moving components (input/output devices). When designing complex systems, there are also competing design constraints. The systems are required to be accepted into standard slots provided in computer hardware. In such environments, there are also power and cost considerations that often prevent the use of the fastest processors available in servers or desktop PCs.
For these environments where space, cost and power are limitations, the system designers are faced with very limited options regarding how to minimize main memory accesses while meeting the power dissipation and cost budgets and also meeting physical space constraints.
In addition to having high hit ratios on embedded L1 and L2 caches, it is often desirable to design additional caches, which can be used to reduce data access times and make data requests to the main memory as few as possible. There are also specialized caches used by virtual memory systems to keep virtual page translation tables which are accessed frequently in memory with short access times.
Traditional caching and cast out algorithms involve some sort of algorithm, e.g., Least Recently Used (LRU), in order to determine which cache line to invalidate or cast out in favor of a newly accessed item. Unfortunately, such algorithms do not have access to information such as: how often a certain cache line is fetched; does a particular address seem to get cast out frequently; and what addresses are likely to get accessed once a given address has been fetched. Such information is very difficult to manage and make decisions upon given traditional caching hardware.
Controlling a processor's internal and external cache memory has been attempted via use of a cache controller being situated logically and physically external to the data requesting device or processor. However, an external cache controller is severely disadvantaged in performing the function of controlling the internal and external cache memory of a processor. Because the cache controller is located external to the processor and at some distance from the processor, the cache controller is unable to operate at processor speeds. The processor performs data requests faster than the external cache controller is able to comply with. The result is that the CPU may encounter stalls in its pipeline as the latency increases.
Also, according to current methods, a program which is fetching sequential data would bring in a cache line and then have hits against the data in that cache line. Then, beginning with the next cache line, the program will have to bring in the next cache line and suffer the long latency involved in fetching from main memory. If the cache system is sophisticated and does some sophisticated speculative read so that the data is in cache, there is the chance that the data will never be used.
However, since there is no mechanism to indicate that it is speculative it will age just like the other cache lines in that set. A cache line which has been accessed before may have a higher probability of being accessed again relative to those which were simply prefetched. Unfortunately, unless there is a way for the cache controller to differentiate, the prefetched line may clutter the cache until it eventually is cast out.
It can be seen then that there is a need for a method and apparatus providing non-L1 instruction caching using prefetch to increase the hit ratio of a computing system.