1. Field of the Invention
This invention relates in general to processor caching methods, and more particularly to a method and apparatus for increasing processor performance in a computing system by optimizing the hit ratio of data requests from data requesting devices by providing the data requesting devices with a caching assistant embedded thereon.
2. Description of Related Art
There has been a dramatic increase in the amount and type of data that computing systems are processing of late. Computing systems routinely process two dimensional and three dimensional images, graphics, audio and video media. Networking has allowed information to be shared throughout the world. The consumer demands seamless access to data and a high level of performance of media containing vast quantities of data. Computing systems are being required to perform more demanding tasks to satiate consumer's media hunger.
In order to increase performance, processors may be provided with embedded caches to store data logically and physically closer to the processor. An embedded cache operates at the processor frequency and therefore allows access to data more quickly than external caches.
Many computing systems like storage controllers, routers and servers use processors to control various hardware components. The processors run real time operating systems, handle interrupts, set up direct memory access transfers, check control information for validity, translate addresses and perform other functions. Because these functions are in the critical functional path, the overall performance of these routines is greatly influenced by processor performance.
Numerous major factors contribute to processor performance. One such factor is the core operating frequency of the processor. Another factor is the amount and type of level 1 (L1) data and instruction caches resident on the processor. Caches are classified by the level they occupy in the memory hierarchy. Early computers employed a single, multichip cache that occupied one level of the hierarchy between the processor and the main memory. Two developments made it desirable to introduce two or more cache levels in a high performance system: the feasibility of including part of the real memory space on a microprocessor chip and growth in the size of main memory in computers. A level one (L1) or primary cache is an efficient way to implement an on-chip memory.
An additional factor is the amount and type of level 2 (L2) caches present, if any. An additional memory level can be introduced via either on-chip or off-chip level two (L2) secondary cache. The desirability of an L2 cache increases with the size of main memory. As main memory size increases further, even more cache levels may be desirable. The L1 cache is higher in the cache hierarchy than the L2 cache. The L1 cache contains less data than the L2 cache and all the data and information that is stored on the L1 cache is also stored on the L2 cache.
The type and stages of the data transfer pipeline within the processor is another important factor affecting processor performance. Another important factor contributing to processor performance is the number of instructions which can be executed simultaneously by the processor.
Effective cache subsystems will desirably provide instruction and data availability with minimum latency. A processor or another data requesting device requests a specific access (piece of information or data). If the access is immediately available in the cache, the request is considered a hit. However, if the access is not already present and available in the cache, this is considered a miss.
By way of definition, a hit ratio is a measure of the probability that an access will be resident in a particular cache. High hit ratios result in lower processing times for similar units of work. That is, if L1 caches run at processor speeds and have the capacity to contain the entire code load, including all necessary peripheral data and instructions, then the resulting processing time would be the smallest time possible. The processor would then be operating at maximum or peak performance.
However, the reality is that modern code loads for complex programs and systems are very large, often many megabytes. Therefore, it is impractical to provide processors with embedded L1 caches having such large capacities. For example, practical constraints have limited L1 caches in processors to 32K bytes or less in most cases. A split L1 cache contains both a data cache and a instruction cache. Sizes of each cache vary considerably, but are commonly 32K each or less. Instruction hit ratios, using economically feasible L1 capacities currently available, have tended to be disappointingly low. To overcome this disadvantage, processors having embedded L2 caches, in addition to the smaller capacity embedded L1 caches disposed therein and which run at processor speed, are desirable.
Processors having embedded L2 caches running at processor speeds provide significant increases in performance while meeting requirements for cost, power and space. Bearing the power, cost and space requirements in mind, an L2 cache having 256K to 512K bytes of memory can be placed on a processor. Unfortunately, many L2 subsystems are only 2 way set associative. This means that for a given tag there are only 2 addresses stored in the cache for that tag. In a complex program or system having lots of branches and/or lots of subroutine calls, this sort of cache can detract significantly from the hit ratio.
Therefore, due to size limitations and the type of L2 cache, the misses may still represent a sizable portion of the fetches done by the processor. A miss will result in fetching from the next level of memory in the memory hierarchy. This can mean significantly more CPU cycles, e.g., as many as 75 CPU cycles or more, to fetch a cache line. Of course, the cycle time is longer for data accesses from main memory than for data access from embedded caches.
Further complicating the main memory access times is the desire for these systems to have a shared memory between the processor and data moving components (input/output devices). When designing complex systems there are also competing design constraints. The systems are required to be accepted into standard slots provided in computer hardware. In such environments, there are power and cost considerations which often prevent the use of the fastest processors available in servers or desktop PCs.
For these environments where space, cost and power are limitations, the system designers are faced with very limited options regarding how to minimize main memory accesses while meeting the power dissipation and cost budgets and also meeting physical space constraints.
It is desirable to design additional caches residing external to the processor which can be used to reduce data access times and make data requests to the main memory as few as possible. There are also specialized caches used by virtual memory systems to keep virtual page translation tables which are accessed frequently in memory with short access times.
Traditional caching and cast out algorithms involve some sort of algorithm, e.g., Least Recently Used (LRU), in order to determine which cache line to invalidate or cast out in favor of a newly accessed item. Unfortunately, such algorithms do not have access to information such as: how often a certain cache line is fetched, does a particular address seem to get cast out frequently, what addresses are likely to get accessed once a given address has been fetched. Such information is very difficult to manage and make decisions upon given traditional caching hardware.
Controlling a processor's internal and external cache memory has been attempted via use of a cache controller being situated logically and physically external to the data requesting device or processor. However, an external cache controller is severely disadvantaged in performing the function of controlling the internal and external cache memory of a processor. Because the cache controller is located external to the processor and at some distance from the processor, the cache controller is unable to operate at processor speeds. The processor performs data requests faster than the external cache controller is able to comply with. The result is that the CPU may encounter stalls in its pipeline as latency increases.
It can be seen that there is a need for a method and apparatus for increasing the processor performance of a computing system.