This invention relates to microprocessors, and, more particularly, to a method and apparatus which improves the operational efficiency of microprocessors having on-chip cache by enabling software management of at least a portion of the cache.
The electronic industry is in a state of evolution spurred by the seemingly unquenchable desire of the consumer for better, faster, smaller, cheaper and more functional electronic devices. In their attempt to satisfy these demands, the electronic industry must constantly strive to increase the speed at which functions are performed by microprocessors. Videogame consoles are one primary example of an electronic device that constantly demands greater speed and reduced cost. These consoles must be high in performance and low in cost to satisfy the ever increasing demands associated therewith. The instant invention is directed to increasing the speed at which microprocessors can process information by improving the efficiency at which data and/or instructions can be loaded for processing.
A cache is a high speed memory that is provided on the microprocessor chip for the purpose reducing the number of times that data required for executing commands must be retrieved from main memory. Cache devices provide a close and convenient place for storing data and/or instructions to be used by the control unit of the microprocessor in a fast and efficient manner. Today, all high-performance microprocessors incorporate at least one on-chip level one (L1) cache for storing previously used data and/or instructions.
Main memory is external to the microprocessor and access thereto is provided through a bus which connects the microprocessor to the main memory. The bus connecting the microprocessor and the main memory is controlled by a Bus Interface Unit (BUI). Due to the fact that the main memory accesses must go through the BUI and bus to obtain the requested data from the off-chip memory, accessing this memory is relatively inconvenient and slow as compared to accessing the on-chip cache.
With today""s technology, accessing the off-chip main memory can take anywhere from ten to hundreds of CPU clock cycles (a time unit by which the microprocessor or central processing unit (CPU) operates). In contrast, accessing on-chip memory, such as a memory designed to operate as an on-chip cache, can take as few as only one or two CPU clock cycles. Thus, data can be retrieved from a cache at least about ten times faster than the time that it would take to retrieve that same data from main memory. As a result, effective use of the cache can be a critical factor in obtaining optimal performance for applications running on a microprocessor. The drastic time difference between loading desired code or data from an on-chip cache as compared to loading from the main memory is so great (an order of magnitude or more) that effective cache management can be a dominant factor in determining the speed of an application executed by the microprocessor, or even the speed of the entire system built around the microprocessor.
Generally speaking, a cache operates by storing data and/or instructions that have been previously requested by the control unit and retrieved from main memory in the on-chip cache for possible use again by the control unit at a later time. If a second request is made by the control unit for that same data, the data can be quickly retrieved from the cache rather than having to again retrieve the data from the off-chip main memory. In this manner, the speed of the application can be increased by minimizing the need to access the relatively slow main memory.
One limitation, however, regarding the use of cache is that size and cost factors limit the cache to a size that is significantly small relative to the size of the main memory. As a result, the cache quickly becomes full with data that has been retrieved from main memory or elsewhere, thereby preventing additional data required by the control unit from being stored in the cache. Typically, a microprocessor, such as the microprocessors in IBM""s PowerPC (IBM Trademark) family of microprocessors (hereafter xe2x80x9cPowerPCxe2x80x9d), includes a 32 kilobyte (32 k) on-chip level one (L1) instruction (I) cache and a 32K L1 data (D) cache (Harvard Architecture), as well as a level two (L2) cache providing additional on-chip cache functionality. For more information on the PowerPC microprocessors see PowerPC 740 and PowerPC 750 RISC Microprocessor Family User Manual, IBM 1998 and PowerPC Microprocessor Family: The Programming Environments, Motorola Inc. 1994, both of which are hereby incorporated by reference in their entirety.
In view of the size limitation on caches, the microprocessor includes hardware that manages the cache in accordance with an algorithm that attempts to predict which data read from main memory is likely to be needed again in the near future by the processing unit. In other words, the cache control hardware is designed according to an algorithm that tries to predict in advance what data from main memory to maintain in the limited amount of storage space available in the cache for later use by the processing unit. Thus, every microprocessor having such a cache incorporates some type of hardware implemented algorithm for managing the contents of the cache. An example of such an automatic replacement algorithm used in the PowerPC is a pseudo least-recently-used (PLRU) replacement algorithm
The automatic replacement algorithm used in a particular microprocessor to manage the contents of the cache, however, is not necessarily optimal or even effective for certain applications run by the microprocessor. In other words, the algorithm implemented by cache control hardware does not always result in efficient cache management for all applications designed for the microprocessor. For example, in certain applications the hardware may drop data from the cache right before it is needed a second time, thereby requiring the processor to obtain the dropped data from main memory, even though the desired data was in the cache moments earlier. This problem results from the fact that one cannot predict in advance the needs of every application that may be implemented using the microprocessor. As a result, some applications will not be able to use the cache in an efficient manner, thereby preventing such applications from running as fast as they otherwise could with efficient cache management. In fact, for some applications, the automatic replacement algorithms perform poorly, thereby preventing the desired low-latency memory accesses for which the cache is designed.
One strategy that has been used in the past in connection with caches to improve application performance is to provide in the instruction set of the microprocessor a mechanism that enables software assisted cache management. Most modern microprocessors provide instructions in the instruction set which enable software to assist the cache management hardware to some degree in managing the cache. For example, the PowerPC architecture contains several user-accessible instructions in the instruction set for manipulating the data cache that can significantly improve overall application performance. These instructions are: xe2x80x9cblock touchxe2x80x9d (dcbt); xe2x80x9cblock touch for storexe2x80x9d (dcbtst); xe2x80x9cblock flushxe2x80x9d (dcbf); xe2x80x9cblock storexe2x80x9d (dcbst); and xe2x80x9cblock set to zeroxe2x80x9d (dcbz). see Zen and the Art of Cache Maintenance, Byte Magazine, March 1997.
In order to understand the operation of these or similar instructions, it is important to define what a xe2x80x9cblockxe2x80x9d is in this context. A block is the fundamental unit of memory on which the cache operates. The cache handles all memory load and store operations using blocks. The particular block size can vary from one microprocessor to another. For example, the PowerPC 601 uses 64-byte blocks, while the PowerPC 603 and 604 user 32-byte blocks.
Each of the above-identified instructions operates on a pair of general purpose register (GPR) operands whose sum forms the effective address of the memory location(s) to be affected by the instruction. The dcbt and dcbtst instructions provide xe2x80x9chintsxe2x80x9d to the processor that the addressed data block may be needed by the application in the near future, thereby giving the processor the option of loading, or at least allocating, the memory block in the cache. Thus, when placed appropriately ahead of the anticipated need for data from memory, the dcbt instruction can be used to request that the processor bring data from memory (or from a secondary cache) into the primary data cache, thereby helping to avoid a cache miss, i.e. not finding the desired data in the cache when needed by the processor. It is noted that these instructions only provide xe2x80x9chintsxe2x80x9d to the processor and do not require that the processor actually load the data block identified by the instructions. In other words, the hints are taken under advisement by the processor which then decides based on other needs whether the data block will be loaded. These hints are treated differently depending on the particular implementation in the PowerPC family. The dcbtst instruction behaves in a similar manner except that it provides the additional hint that the corresponding memory location is going to be overwritten soon. Thus, these two instructions do not provide direct control over the contents of the cache. Moreover, even if the processor does load the identified block, the automatic replacement algorithm may overwrite that data in the cache before it can be used. As a result, these instructions have limited ability to help manage the cache in an efficient manner for a particular application.
The dcbf and dcbst instructions in the PowerPC instruction set force modified (or dirty) data out of the cache and back into memory. The primary difference between these two instructions is that dcbf not only copies data back to memory (like dcbst), but it also marks the corresponding cache block as invalid. The dcbz instruction in the PowerPC instruction set allocates a block of data in the cache and then initializes the block to a series of zeros. Thus, this instruction modifies data in the cache. This instruction can be a useful tool for boosting performance when zeroing a large block of data. Again, these three instructions have only a limited ability to provide software assistance to the cache management hardware, because the hardware is still in control of the data in the cache and will continue to operate on the entire cache in accordance with the hardware algorithm.
Thus, while some tools are available for programmers to use in order to try and help their particular application utilize the cache in an efficient manner, the cache is still ultimately controlled by the cache management hardware. As a result, if the replacement algorithm implemented by the hardware does not perform well for a particular application, the application will not run in an optimal manner on the microprocessor, regardless of the limited software assistance that can be provided by instructions such as those described above. In other words, the programmer cannot completely control the contents of any portion of the cache with these or similar software techniques. Thus, many applications do not run as fast as they otherwise could if the cache could be managed in a more efficient manner for the particular application. Accordingly, a need exists for improving the use of cache in a microprocessor on an application specific basis. The instant invention satisfies this need.
It is noted that, in addition to cache, some microprocessors have on-chip memory that can be used as a scratch pad memory (SPM), i.e. an on-chip memory that does not perform demand-driven, automatic replacement of memory blocks. An SPM can be used by the application to store, for example, data from a contiguous portion (memory block) of the main memory, wherein a base address of the contiguous memory block is defined in a specific register created for this purpose or is defined using the memory management facilities that exist for accessing other parts of the memory system. However, SPM is a memory, not a cache, and therefore does not provide the advantages or functionality of a cache. Some microprocessors have enabled a certain contiguous range of the available address space to be allocated to the SPM. While use of on-chip SMP can improve the performance of the microprocessor for certain applications, it does not solve the problem of cache management described in detail above.
The instant invention enables more efficient use of on-chip cache, such as a level one data cache in the IBM PowerPC or the like, by enabling the cache to be partitioned into a first portion that operates in a conventional manner under hardware control, and a second portion which can be completely controlled and managed by software.
In a preferred embodiment of the invention, the entire cache can be used as a conventional cache, or the cache can be reconfigured so as to define part of the cache as a locked software controlled cache, referred to herein as a xe2x80x9clocked cachexe2x80x9d or xe2x80x9cscratch pad cachexe2x80x9d. What is meant by xe2x80x9clockedxe2x80x9d in this context is that the contents of the locked portion of the cache cannot be altered by the hardware cache management system. By locking the portion of the cache designed for software control, the cache can be managed independently of the hardware cache management system. As a result, poor performance can be avoided even if the hardware algorithm does not work well for the needs of a particular application.
The locking of the cache can be done by, for example, providing a special purpose register (e.g. HID2) having a bit (e.g. 4th bit) which controls whether the cache is to function as a single hardware controlled cache or a partitioned cache including a locked software controlled portion. Other bits in the special purpose register can be used, for example, to control other enhancement options that may be available on the microprocessor.
In order to allocate blocks of memory in the locked cache, at least one additional instruction (e.g. dcbz_1) is preferably provided in the instruction set of the microprocessor that implements this invention. This additional instruction is used to control on a block-by-block basis exactly what data is contained in the locked portion of the cache, thereby giving the application programmer complete freedom to keep or remove blocks of data from the cache as required for efficient processing of his application. In this way, the utilization of the cache can be maximized regardless of the particular hardware algorithm used to manage the normal or non-locked portion of the cache. Moreover, by enabling selective partitioning of the cache into a locked and regular cache, the entire cache can still be used in the conventional manner if it is determined that the hardware algorithm will work satisfactorily for a particular application. On the other hand, if it is determined that the particular application would perform better using a portion of the cache as a software controlled cache, the cache may be partitioned and locked so that the locked portion can be managed exclusively by the software.
The locked portion of the cache is basically a scratch pad memory (SPM) but which operates as a cache instead of a memory. Thus, the invention provides a locked cache or scratch pad cache (SPC) that can be used and managed solely by the application, without interference from the hardware cache management system, to increase the efficiency of the application on a particular microprocessor.
A major advantage of the locked cache as compared to a convention scratch pad memory is that the locked cache allows flexibility of software management for data that does not exhibit the spatial and temporal locality for which normal caches are designed. One example is a multimedia data stream that is larger than the cache, and is used once and then discarded. Such data would replace all other data in a normally managed cache, and then continually replace itself, yielding very low cache hit rates. Another example, having different characteristics, is high usage instruction libraries or data tables which do not fit in the cache, but have access patterns that would cause thrashing if they are managed by hardware. The invention overcomes these and other problems by allowing 100% software control of a portion of the cache.
In accordance with a preferred embodiment of the invention, a direct memory access (DMA) engine is provided in the microprocessor and is used to transfer data between the locked data cache and the external memory. The DMA engine""s operation is controlled by two special purpose registers (SPR) used respectively to control one DMA transfer into and out of the locked cache.