1. Field of the Invention
The present invention relates generally to memory systems, and more particularly to optimizing memory utilization for a requesting processor.
2. Related Art
Processors are becoming exceptionally fast, often operating at speeds of hundreds, if not thousands, of millions of cycles per second. Memory speed has also increased, but is still slow relative to processor speed. In order to allow processors to use all of their power and speed, a high-speed memory known as cache is used as the interface between the fast processor and the slower main memory. When cache memory is built into the processor itself, it usually runs at the same speed as the processor. External caches typically run slower than the processor, but faster than the speed of main memory. Without cache, the processor must read and write directly to the main system memory, which limits the processor's maximum speed to that of the memory.
Cache memory is most often divided into an Instruction Cache (ICache) and Data Cache (DCache). Because the processor tends to access instructions (which comprise the program that is being executed) in a different manner than it accesses the data that is used by the program, keeping the ICache separate from the DCache improves system efficiency.
Cache memory is generally much larger than standard memories, such as DRAM, due to the requirements for more speed and extra tag information. It is difficult to meet timing, routing and power requirements of the processor when very large caches are used. The size of the chip usually increases when a larger cache is integrated, which increases the cost of the chip. System and processor designers must strike a balance between performance related to cache size and the total cost of the processor or system. As a result, the cache is usually kept to a fairly small size from 4-to-256 kilobytes, especially for the cost-sensitive embedded systems market. As the cache size shrinks, issues like cache utilization and efficiency become very important. In some cases, additional hardware can be added to address the problem, but in most cases, hardware size must be limited. Therefore, there is a need to find more efficient manners to use the available cache.
While the size and cost of the processors are being driven down, the size and complexity of the applications running on those processors are growing. The demands for multimedia and broadband communications applications are stressing system components to their maximum, fueling the demand for more power and speed. Except for the most trivial applications and the most high-end processors, the size of the application almost always exceed the capacity of the cache memory.
It is not uncommon to see embedded systems applications that exceed two megabytes. PC-based applications can be tens of megabytes or larger. However, a very small percentage of the code usually executes most frequently. Quite often this code can fit within the cache. Even so, it is possible for this small portion of code to make very inefficient use of the cache. The inefficiency can be so bad that performance is almost as low as if there had been no cache at all. The processor is only getting the benefit of the burst reads and writes to main system memory.
As the processor executes an application program, it fetches instructions from the ICache. The ICache is responsible for ensuring that the instructions being fetched are present in the cache and for reading instructions from the main memory, or a second level cache, when they are not present. The same is true for data in the DCache. The program causes the processor to read or write data through the DCache, which is responsible for fetching or flushing information from the main memory as needed. The cache may also pre-fetch from the main memory using various prediction algorithms in an attempt to minimize the amount of time that the processor has to wait for instructions or data to be fetched into the cache.
When the processor tries to read or write an address, the cache must map the address to a cache line and determine whether or not the cache line contains the requested information. There are a number of algorithms for this mapping. One common algorithm is to use some number of low-order bits from the address to form an index for the cache line. For example, if the cache line size is sixteen bytes and there are 256 cache lines, then the lower four bits of the address could be used as the byte address within the cache line and the next eight bits could be used as the cache line index. Additionally, the set associativity comes into play. Given a mapping to a cache line, the cache must then check to see which set the address has been mapped to. For a two-way set associative cache with 256 lines, there are effectively 256 pairs of cache lines.
The cache uses a single algorithm to map the requested address to a cache line, regardless of the number of sets, but the address could be placed in any of the sets within that cache line. The cache usually contains extra information (tags) that determines which set within the cache line contains the requested address. The more sets there are in the cache, the more addresses it can map to a cache line without causing existing data to be flushed. However, given a constant total cache size, adding more sets will decrease the number of cache lines. In other words, an 8k cache with 4 sets will have half as many cache lines as an 8k cache with 2 sets.
If the requested information is present in one of the sets of the computed cache line, the cache provides the data to the processor, and everything proceeds at full speed. If the information is not present, the cache must block the processor as it fills one or more cache lines in order to satisfy the request. Filling the cache lines causes instructions or data already present to be flushed and/or discarded.
As with most software and hardware systems, the underlying architecture of an application is very important. Some programmers rely on the compiler and linker to do a reasonable job organizing the application code in cache, and live with the results. In very special circumstances (with very small programs), the programmers may hand-code the application for efficiency in memory, CPU, and cache utilization. The programmers may choose to rewrite and re-architect the software so that the important code is all in one module and is guaranteed to be adjacent and minimally overlapping in cache. Of course, this is not feasible for most systems and is extremely difficult and tedious in even the most limited of cases.
In other words, rewriting the code is an option for placing the performance critical functions in a monolithic file, ensuring cache efficient code. However, a code rewrite would destroy the architecture, modularity, and flexibility to use the software for other application domains, and increase the difficulty of maintenance. A code rewrite would also cause unwanted delays in the product delivery schedule, as well as increase the risk of introducing bugs. Additionally, this approach cannot accommodate code that is not immediately part of a software application, such as third-party libraries and operating system components. Obviously, this approach is extremely undesirable.
Therefore, what is needed is a way to ensure cache efficiency that overcomes the aforementioned problems.