1. Field of the Invention
The present invention relates to a memory buffer in a microprocessor, and more particularly to the design of a buffer in a cached memory system for gathering uncached store operations so that the store data is efficiently issued to the system interface utilizing substantially all of the available system interface bandwidth.
2. Related Art
Modern microprocessing systems often employ cache memory as part of an overall memory hierarchy to reduce the amount of time required for memory accesses. The cache occupies the highest position in the memory hierarchy, shuttling data between the central processing unit (CPU) and the main memory unit (MMU). Any performance gains achieved through the use of a cache are premised on the theory of locality: all programs favor a portion of their address space at any instant of time. Cache memory ideally stores this portion of the address space using fast, but expensive, memory hardware. The MMU stores a much larger amount of data used by the application in slower, less expensive memory. If the CPU looks only for data within this favored address space, and the address space fits within the cache, the memory bandwidth available to the CPU is equal to that of the faster cache memory. The model breaks down as the CPU fails to find the data it needs within the cache, and is forced to access the slower MMU.
Whereas cached accesses make use of the cache memory for improved performance, certain other types of memory operations bypass the cache completely to directly access external memory locations. These operations, termed "uncached" accesses, are typically used for initialization, diagnostic/error handling routines, and where data is to be written to an external device rather than stored in the MMU.
For instance, graphics software routines often store data to a frame buffer or graphics engine for displaying an image on a video device. A frame buffer is a hardware memory device used to store the image displayed on screen. The image data is stored pixel by pixel in adjacent memory locations and read out to the graphics display at the refresh rate. The term graphics engine refers to a separate unit having its own memory and processor for driving an output video device. The CPU communicates with the graphics engine by writing data to a single address, or port. The uncached store operation bypasses the cache and writes the data to the external device specified by the given address. Uncached stores issued by graphics software routines to a frame buffer often consist of numerous, sequentially-addressed memory accesses. On the other hand, uncached stores to a graphics engine often consist of numerous, identically-addressed memory accesses.
Microprocessor designs should ideally provide a high uncached store throughput with a minimum amount of system overhead. Typically, uncached store throughput is limited by the available system interface bandwidth. Microprocessors that implement a multiplexed system interface address/data bus require two system clock cycles to issue one individual uncached store operation. The term "individual" store operations is used in this context to distinguish operations which store a single data element to a single memory location, as opposed to "block" operations which store multiple data elements to multiple memory locations beginning at a specified address. All individual store operations using the bus send an address during a first clock cycle, followed by data on the next cycle. Therefore, individual uncached store operations can achieve no better than half the available system bandwidth.
Often, some type of arbitration must be performed before a microprocessor can issue uncached store operations to the system interface. System resources must be expended to resolve this contention each time and operation requests use of the bus. Thus, issuing numerous, individual uncached stores is often inefficient due to the overhead associated with bus arbitration and the fact that the throughput is limited to half the available bandwidth. Since bursts of sequentially and identically addressed uncached accesses are common, a method to increase their throughput is desirable.
Special purpose memory buffers have been designed to improve the throughput of uncached stores. These buffers, known as store gatherers, collect uncached store operations and issue the store data as a block operation rather than as a series of individual stores. Such uncached store buffers have been implemented in external agent chips, which typically provide an interface between the microprocessor's system interface and a backplane bus. Such store gatherers minimize system overhead, since a block store requires a single backplane bus arbitration, rather than a separate arbitration for each of a series of store operations.
However, current store gatherers are functionally limited in a number of ways. No provision is made for handling identically and sequentially addressed streams of uncached store data. Current systems also do not operate efficiently when the gathered data does not precisely fill a complete block. Variable length blocks occur when the block must be sent to the bus before the block is completely filled. Variable length blocks also occur when a double-word store straddles the end of one block and the beginning of the next. Here, current store gatherers issue the first incomplete block, and begin collecting the next block with the double-word store data. The complexity of the overall system is increased when the system is required to handle variable length blocks. Furthermore, current store gatherers are not located on the microprocessor chip itself.
Thus, what is needed is an uncached store gatherer that resides on the microprocessor chip itself, provides for streams of both sequential and identical addresses, and efficiently handles variable length blocks.