The invention is generally related to data processing systems and processors therefor, and in particular to retrieval of data from a multi-level memory architecture.
Computer technology continues to advance at a remarkable pace, with numerous improvements being made to the performance of both microprocessorsxe2x80x94the xe2x80x9cbrainsxe2x80x9d of a computerxe2x80x94and the memory that stores the information processed by a computer.
In general, a microprocessor operates by executing a sequence of instructions that form a computer program. The instructions are typically stored in a memory system having a plurality of storage locations identified by unique memory addresses. The memory addresses collectively define a xe2x80x9cmemory address space,xe2x80x9d representing the addressable range of memory addresses that can be accessed by a microprocessor.
Both the instructions forming a computer program and the data operated upon by those instructions are often stored in a memory system and retrieved as necessary by the microprocessor when executing the computer program. The speed of microprocessors, however, has increased relative to that of memory devices to the extent that retrieving instructions and data from a memory can often become a significant bottleneck on performance. To decrease this bottleneck, it is desirable to use the fastest available memory devices possible. However, both memory speed and memory capacity are typically directly related to cost, and as a result, many computer designs must balance memory speed and capacity with cost.
A predominant manner of obtaining such a balance is to use multiple xe2x80x9clevelsxe2x80x9d of memories in a memory architecture to attempt to decrease costs with minimal impact on system performance. Often, a computer relies on a relatively large, slow and inexpensive mass storage system such as a hard disk drive or other external storage device, an intermediate main memory that uses dynamic random access memory devices (DRAM""s) or other volatile memory storage devices, and one or more high speed, limited capacity cache memories, or caches, implemented with static random access memory devices (SRAM""s) or the like. In some instances, instructions and data are stored in separate instruction and data cache memories to permit instructions and data to be accessed in parallel. One or more memory controllers are then used to swap the information from segments of memory addresses, often known as xe2x80x9ccache linesxe2x80x9d, between the various memory levels to attempt to maximize the frequency that requested memory addresses are stored in the fastest cache memory accessible by the microprocessor. Whenever a memory access request attempts to access a memory address that is not cached in a cache memory, a xe2x80x9ccache missxe2x80x9d occurs. As a result of a cache miss, the cache line for a memory address typically must be retrieved from a relatively slow, lower level memory, often with a significant performance hit.
Some conventional cache designs are xe2x80x9cpipelinedxe2x80x9d, which permits the caches to start subsequent missed memory access requests prior to completion of earlier missed memory access requests. Pipelining reduces memory access latency, or delay, since the handling of cache misses for subsequent memory access requests often does not need to wait until the processing of an earlier miss is complete.
One limitation to cache pipelining arises from the fact that oftentimes memory access requests that are issued closely in time are for data stored in a narrow range of memory addresses, and often within the same cache line. As a result, pipelining may still not effectively reduce memory access latency in some instances because multiple memory access requests may be waiting on the same cache line to be returned from the cache, thus preventing requests for other cache lines from being processed by the cache.
For example, many processors support the use of complex memory instructions that in essence request data from a sequential range of memory addresses. Conventional processor designs typically handle a complex memory instruction by sequentially issuing memory access requests for each memory address in the range specified by the instruction. Moreover, conventional processor designs typically have request queues that are finite in length, so that only a limited number of requests may be pending at any given time.
In many instances, complex memory instructions request data spanning across more than one cache line. A significant performance hit can occur, however, when a portion of the requested data is stored in a cache line other than the first requested cache line, and when that portion of the requested data is not currently stored in the cache. Specifically, with a fixed length request queue, a request for a subsequent cache line may not be issued until after a first cache miss has been completed. No pipelining of the cache misses occurs, and as a result, performance is degraded.
Put another way, assume a complex memory instruction generates a sequence of memory access requests for 0 . . . N words of data starting, for example, at the beginning of a cache line, and that each cache line is capable of storing n words of data. A conventional processor would issue the 0 . . . N memory access requests as follows:
0, 1, 2, . . . , nxe2x88x921, n, n+1, n+2, . . . , 2nxe2x88x921, 2n, 2n+1, 2n+2, . . . N.
Assuming that memory access request 0 misses the cache, all other memory access requests in the same cache line (requests 1 . . . nxe2x88x921) will also miss. As a result, the request queue will typically fill up with requests that are waiting on the same cache line to be retrieved, thereby stalling the processor until the cache miss is completed. Once the cache line is retrieved, the requests can then be handled in sequence, until such time as the first request for the next cache line (request n) can be issued. Assuming also that this request also misses the cache, the next cache line will need to be retrieved. However, given that the request could not be issued until after the first cache miss was completed, pipelining of the cache misses would not occur. As a result, memory access latency increases and overall performance suffers.
Therefore, a significant need exists in the art for a manner of reducing the memory access latency associated with sequences of closely-spaced memory access requests, particularly with respect to complex memory instructions and the like.
The invention addresses these and other problems associated with the prior art by providing a data processing system, circuit arrangement, integrated circuit device, program product, and method that reduce the memory access latency of sequences of memory access requests by processing at least one request in a sequence out of order relative to another request based upon whether the other request is likely to be for data maintained in the same organizational block (e.g., a cache line) as data requested by an earlier memory access request in the sequence.
Embodiments of the invention attempt to increase the overall throughput of memory access requests by attempting to process certain memory access requests during times in which other memory access requests might otherwise be stalled. For example, in a multi-level memory architecture where data is passed between memory levels in organizational blocks such as cache lines, it is known that whenever a particular memory access request misses a higher level of memory, any other memory access requests that request data in the same organizational block will likewise miss that level of memory. Consequently, any such other memory access request cannot be processed until the block containing the data requested by the first request is retrieved from a lower level of memory. Therefore, by attempting to process other memory access requests prior to one or more requests directed to data in the same organizational block as a particular request, the likelihood increases that other memory access requests can be processed during any dead time that would otherwise result if that particular request resulted in a miss.
As a result, request throughput and memory access latency are often improved since productive operations can be performed during time periods that would otherwise be wasted while multiple requests were waiting on the same block of data. Moreover, in embodiments that support pipelined processing of misses, processing other memory access requests prior to requests directed to the same organizational block as a particular request oftentimes further reduces memory access latency. In such embodiments, should an out of order request also result in a miss, that later miss can be processed concurrently with the first miss so that the data requested by the out of order request is returned more quickly than would otherwise occur if the out of order request could not be processed until after processing of the first miss was complete.
Among other implementations, request reordering consistent with the invention may be utilized to improve the performance of complex memory instructions that request a plurality of contiguous words spanning more than one organizational block in a multi-level memory architecture. In such implementations, a complex memory instruction may be processed by processing a memory access request for a word in a block other than the first block before processing the memory access requests for all of the words in the first block. Consequently, in the event that any memory access request to the first block results in a miss, the memory access request for the word in the other block is more likely to be processed prior to completion of the processing of the miss, thereby potentially reducing the memory access latency of that request.
These and other advantages and features, which characterize the invention, are set forth in the claims annexed hereto and forming a further part hereof. However, for a better understanding of the invention, and of the advantages and objectives attained through its use, reference should be made to the Drawings, and to the accompanying descriptive matter, in which there is described exemplary embodiments of the invention.