1. Field of the Invention
This invention relates in general to microprocessor systems, and more particularly, to the efficient utilization of the write-combining buffers through the implementation of intermediate buffers.
2. Description of Related Art
The use of a cache memory with a processor facilitates the reduction of memory access time. The fundamental idea of cache organization is that by keeping the most frequently accessed instructions and data in the fast cache memory, the average memory access time will approach the access time of the cache. To achieve the maximum possible speed of operation, typical processors implement a cache hierarchy, that is, different levels of cache memory. The different levels of cache correspond to different distances from the processor core. The closer the cache is to the processor, the faster the data access. However, the faster the data access, the more costly it is to store data. As a result, the closer the cache level, the faster and smaller the cache.
The performance of cache memory is frequently measured in terms of its hit ratio. When the processor refers to memory and finds the word in cache, it is said to produce a hit. If the word is not found in cache, then it is in storage device and it counts as a miss. If a miss occurs, then an allocation is made at the entry indexed by the access. The access can be for loading data to the processor or storing data from the processor to memory. The cached information is retained by the cache memory until it is no longer needed, made invalid or replaced by other data, in which instances the cache entry is de-allocated.
When a processor accesses memory for transfer of data between the processor and the memory, that access can be allocated to the various levels of cache, or not allocated to cache memory at all, according to the memory type set up by the system or the locality hint associated with the instruction. Certain instructions are used infrequently. For example, some specific prefetch instructions can preload data which the processor does not require immediately into a dedicated prefetch buffer, but which are anticipated to be required in the near future. Such data is typically used only once or will not be reused in the immediate future, and is termed xe2x80x9cnon-temporal dataxe2x80x9d. Instructions that load or prefetch data stored in the cache which the processor uses frequently, are termed xe2x80x9ctemporal dataxe2x80x9d.
Non-temporal write instructions or stores typically utilize a write-combining technique which first combines stored data that is being accessed into groups and then sends the combined groups to the external bus. Such combining of the outgoing data increases utilization of the bus bandwidth, which subsequently increases the write throughput of the processor.
However, the implementation of such a write combining technique suffers from a number of drawbacks. First, the number of the write combining buffers is limited. Second, the buffers are used for both loads and stores. These limitations cause performance reduction in some situations.
Accordingly, there is a need in the technology for a write combining technique that provides efficient use of the write combining buffers.
The present invention discloses a method and apparatus method for efficient utilization of write-combining buffers for a sequence of non-temporal stores to scattered locations. The method comprises: converting the sequence of non-temporal stores to stores to intermediate buffers; and grouping the stores to intermediate buffers into consecutive non-temporal stores. The consecutive non-temporal stores correspond to adjacent memory locations in the write-combining buffers.