1. Field of the Invention
This invention relates generally to increasing cache efficiency, and particularly to a system and method employing intelligent data allocation to processor cache to increase the probability of getting cache hits while reducing paging.
2. Description of the Related Art
Modern high-performance data processors use a private high-speed hardware-managed buffer memory in front of the main data store to reduce average memory access delay at the Central Processing Unit (CPU). This high-speed buffer is denominated a "cache" because it is usually transparent to the applications programmer. Generally, the theory behind such a cache is that data which was once referenced by a program is likely to be needed again. By keeping such likely to be needed data in high-speed memory, overall performance is improved.
Since cache is transparent and operating systems typically have built-in memory allocation processes, most programmers do not design their programs with cache allocation in mind. Unfortunately, most cache allocation procedures fail to accurately consider the most likely scenario of the size of data that a program will need. Yet, allocation and use of memory have a significant effect on program execution performance. Recent empirical results from testing by computer scientists show that caching allocation inefficiencies, such as cache misses, can increase program execution by up to 25%. Such empirical results as well as popular allocation procedures are discussed in Dirk Grunwald, Benjamin Zorn, and Robert Henderson, "Improving the Cache Locality of Memory Allocation", Proceedings of the ACM SIGPLAN '93 Conference on Programming Language Design and Implementation, Albuquerque, N. Mex., June 1993, pp. 177-195.
The performance of cache memory is often characterized by "hit/miss" ratios. A "hit" means that a reference to the cache generated by a requesting CPU executable process locates the data item desired in high-speed cache memory, rather than in lower speed main memory. A "cache miss" is registered if the data is unavailable in cache memory. The performance of cache memory can also be measured indirectly in terms of paging. Paging refers to the process of transferring data blocks from secondary storage (e.g. a disk drive) to main memory. If requested data is not in main memory then paging is required to get the data. When the desired data is not located in main memory a page fault is said to have occurred. In terms of cache, a cache miss and a page fault may be related. A significant increase in paging indicate poor cache allocation and inefficient program performance. However, because of the direct relationship between cache performance and cache miss activity, the greatest benefit is derived from reducing cache misses.
Recently, processor speeds have been greatly increasing and this enhances the effects of cache memory allocation. An example of the radical increase in processor speeds can be seen by considering processors in personal computers. Early versions of processors for personal computers ran at about 4 Mhz but speeds of 100 Mhz are now commercially available to the general public. Processor speed is expected to continue this rapid upward ascent. Faster processor speeds have shifted interest from slower main memory to high-speed cache memory. Cache memory must be able to provide instructions and data to processors at very fast rates to keep up with the processor. New processors commonly use a smaller on-chip primary cache (processor cache) and a larger secondary cache in front of main memory.
Processor cache is small, fast storage that is dedicated to the processor. When the processor executes an instruction that references data in memory, the processor first looks in its on-chip cache for the data. If there is a cache hit, the data in the processor cache is used and there is no reason to access slower main memory cache. If the data is not in the processor cache, the main memory cache is referenced.
Dynamic storage allocators are responsible for allocating and deallocating memory including processor cache. Generally, such dynamic storage allocators operate according to some computer procedure to achieve a good allocation of available memory. Inherent in the procedure design is tension between the need to conserve memory on the one hand, and the need to minimize processor overhead associated with implementing the procedure. Such allocators do not allocate processor cache directly, rather, cache management hardware maps the allocation of main memory to a specific cache location based on a hash of the main memory address. Although allocation of cache memory is a secondary effect of the allocation of main memory, for the sake of brevity reference will be made to cache allocation as the ultimate result of such allocation without reference to steps in-between, such as the intervention of cache hardware to map the allocation of main memory to a specific cache location.
Procedures for dynamic storage allocation can be divided into three broad categories: sequential-fit procedures; buddy-system methods; and segregated storage procedures. The first two procedures are fairly complicated and therefore consume a lot of processor overhead. Generally, each of these two requires searching a doubly-linked freelist for free blocks sufficiently large for receiving data. Searching such a freelist, in addition to increasing CPU overhead may also lead to increased paging and cache misses because much more time is spent searching the freelist than allocating cache memory and locating cache data. The reason that the sequential-fit procedures and the buddy-system methods retain favor is that they each allocate cache space fairly efficiently without much wasted space. On the other hand, because of the complexity involved with each, neither is really a very good choice for a small amount of available cache that must be allocated and deallocated extremely fast, such as processor cache. Therefore, this leaves segregated storage procedures as a primary potentially viable candidate for processor cache.
Segregated storage procedures include a wide variety of approaches; however, one that has enjoyed widespread popularity is known as "BSD." The BSD procedure derives its name from the 4.2 BSD Unix software release in which it was first distributed in February, 1982. Essentially, the procedure rounds object size requests to powers of two minus a constant while maintaining a freelist of objects of each size class. If no objects of a particular size class are available, more storage is allocated. Because the procedure is so simple, it can be implemented very quickly without much processor overhead. Unfortunately, it also tends to waste a considerable amount of cache space, especially if the size requests are often slightly larger than the size classes provided. A considerable amount of fragmentation results as user data ends up distributed in several different places in cache. Fragmentation violates the well-known principle that it is generally better to cluster similar items in one place to increase efficiency of retrieval and so increasing performance of execution. When similar items, such as all of the data of a user data block are clustered together in memory then the memory is said to have good "spatial locality".
Another segregated storage procedure that is relatively fast to execute is known as "QUICKFIT." Unlike the BSD procedure, which rounds size requests to powers of two, QUICKFIT rounds to multiples of word sizes, e.g., 4, 8, or 16 bytes. Internal fragmentation is reduced because the size of pools allocated usually corresponds closer to that requested by application programs. However, any round off procedure is subject to internal fragmentation from the wasted space that is inherent in an approximation routine. Such fragmentation leads to an increased probability for cache misses and paging requirements. The trade-off of lower processor overhead from easy-to-compute round-off approximations imparts swift execution properties to the segregated storage procedures and makes them the best available allocators for processor cache. Yet, it would clearly be an improvement in the art to reduce the internal fragmentation of dynamic storage procedures without significantly increasing the amount of CPU overhead associated with their execution. It would also be an improvement to do the above without violating the principles of spatial locality to avoid increasing the effort of locating data.