1. Field of the Invention
The present invention relates to computer architecture, more especially to memory access pattern optimization.
2. State of the Art
Modem high-performance Reduced Instruction Set Computer (RISC) CPUs typically require very high rates of data transfer to and from external memory devices holding program data in order to achieve fast program execution rates. This is because a significant fraction of normal programs consist of memory access (load and store) instructions that transfer data from external memory to internal registers or from internal registers to external memory prior to performing computations; thus, in order to obtain an overall high rate of program execution, it is necessary to make these data transfers as fast as possible.
Modern memory devices are generally capable of relatively high rates of data transfer only when the transfers are conducted in the form of long, uninterrupted "bursts", where data words at consecutive memory locations are read or written in consecutive clock cycles, without any break or gap. This limitation is due to the physical implementation of the memory devices; it is normally very difficult to achieve significantly high data transfer rates to and from memory devices with random accesses to arbitrary memory locations at non-uniform intervals. Instead, most low-cost memories such as dynamic RAMs (DRAMs) offer an operating mode where accesses made to an individual data item will require a relatively long time to complete, but accesses to subsequent data items located at sequentially following memory locations can then be completed quickly, as long as there is no break or gap in performing these additional accesses.
Unfortunately, due to the intrinsic structure of computer programs, it is difficult for traditional CPUs to perform accesses to external memory devices using the long burst transfers previously described. It is typical for programs to generate short, random accesses to various memory locations at nonconsecutive addresses. (An exception to this is the class of programs used in scientific and numerical computing: the input and output data for these programs are naturally structured into regular patterns, such as vectors or matrices, that lend themselves to the burst access patterns favored by memory devices. Other classes of programs, however, notably the complex control functions required in communications devices, do not follow this model.) Such nonconsecutive memory access modes can substantially reduce the data transfer rate possible between the memory and the CPU, and, as a consequence significantly reduce program execution rates. In addition, the path between the CPU and the main memory can be quite long (in terms of time required to traverse the path), and hence random unrelated memory accesses will be performed quite slowly.
The standard method of obviating this problem has been to use a hardware element called a data cache. A cache is a small block of random-access memory that is placed in between the CPU and the main memory that temporarily holds data items that are required by the CPU during program execution. The basic operating principle of a cache is that of locality of reference: this refers to the fact that most programs typically only deal with a small group of data items at a time (even though these elements are accessed in a highly random and unstructured manner), and make repeated accesses to data items in this group. Thus, if by some means these localized accesses are trapped and directed to the cache, then they can be satisfied very rapidly, much more so than if they were sent to the main memory instead. As the cache is quite small, it can be constructed from very fast but expensive memory devices; random accesses by the CPU to the cache are therefore much faster than similar accesses to the main memory. A cache replacement algorithm is used to dynamically adjust the set of data items contained within the cache as the focus of the program execution changes over time. Data items can be brought into the cache in groups, using long burst accesses to the main memory; after they are no longer needed by the CPU, the updated data items can be written back to the main memory, again in long bursts. All of this activity takes place without any intervention by the program, or, indeed, any knowledge by the program or programmer that a cache even exists.
Data caches, however, provide a performance improvement only if the program execution performs repeated accesses over a short period of time to a small group of data items (though these may be distributed over arbitrary items). This behavior is true of general-purpose programs. However, if the program being executed were to exhibit a low locality of reference (i.e., program instructions make reference to each data item only once before moving to the next item), then a data cache would be useless. In fact, under these circumstances a data cache may actually be detrimental to performance, because most data caches typically attempt to transfer data to/from memory in small blocks or cache lines of 4 to 64 data words at a time; if the program were, say, to access only one or two of these data words, then the cache would actually consume more memory bandwidth in transferring the unused data words to/from main memory. This phenomenon is known as cache thrashing, and manifests itself in the form of continuous transfer of data between the cache and the main memory accompanied by a very low rate of program execution. Another problem associated with caches is that of memory consistency. Data items are essentially copied into the cache; thus, for every data item in the cache, there is a corresponding copy in the main memory. If the CPU updates a data item in the cache, the copy in main memory will become out-of-date (or stale); if some external device, such as another CPU or a Direct Memory Access (DMA) controller device, were to access the copy in the main memory, then it would receive stale data, which could result in system failures. On the other hand, always forcing the cache to keep the copies in main memory up-to-date would essentially eliminate the benefits of the cache, as program execution would have to be stopped until the cache could transfer the updated copies to main memory. (An alternative is to have the external device attempt to access both the cache and the main memory, but this is both difficult and costly.)
Unfortunately, many programs used in embedded communications and control applications exhibit poor locality of reference and also suffer from memory consistency problems. As an example consider the case of an Ethernet packet switch controller that contains an embedded CPU for performing control functions as well as a DMA controller for handling data transfers. When an Ethernet packet is to be transmitted by this system, the embedded CPU must inspect and potentially modify the Ethernet packet data in main memory, and then indicate to the DMA controller that the packet data are to be transferred from main memory to the Ethernet physical link. If the CPU contains a cache, then it is very likely that after the CPU has completed modifying the packet the copy of the packet in the main memory will be stale, as the up-to-date copy will be residing in the cache. When the DMA controller transfers data from main memory to the Ethernet physical link, it will read and transfer the wrong (old) data, not the modified packet.
In addition, it is unlikely that the CPU will make more than one access to each of a small number of words of the packet during the modification process. Thus, if the data cache were to read the packet from main memory in the normal manner, there would be considerable wastage of memory bandwidth, as a substantial number of words would be read but only a small number of words would be actually modified by the CPU for each packet being processed. As the Ethernet packet switch is expected to process thousands of such packets per second, with consecutive packets being stored in widely different memory locations, the net loss of efficiency is considerable. A solution that has been used in some specialized CPUs is to dispense with the data cache altogether, and use a queued approach to accessing memory. (This is also referred to as a decoupled access/execute architecture.) In such a system, the CPU is permitted to notify a memory access control unit (that regulates accesses to the main memory) in advance that an access will be made to specific data words. The use of queues within the memory access control unit permits multiple such notifications to be made by the CPU, well in advance of when the data are actually required from the main memory during program execution. The memory access control unit is then responsible for sorting out the requests for memory data, capturing such locality as may exist, and fetching (or storing) the data from the main memory in the most efficient manner possible. The memory access control unit also uses a set of queues to return the data to the CPU, which may then, at some future time, read the data out of these queues. The memory access execute unit, in conjunction with the CPU program, thus eliminates the latency effects incurred by random accesses to the main memory from impacting the program execution. At the same time, it avoids the memory consistency problem by not maintaining data copies indefinitely. In effect, the decoupled access/execute architecture renders the memory access mechanism visible to the programmer of the CPU (who has to cause the advance notifications of memory accesses to be generated by inserting appropriate instructions into the program).
Two problems with the decoupled access/execute architecture exist. The first is the complexity of the scheme: the memory access control unit must be made fairly complex in order to capture locality and improve efficiency. The second is that accesses are being made continuously to memory, as there is no data cache; hence those portions of the program data that are actually accessed repeatedly and with high locality will not be benefited by the significant speed-up available from a cache as local copies are not maintained.
To summarize the problem: in high-speed CPUs, it is necessary to optimize accesses to memory generated by executing load/store instructions, as far as possible in order to preserve processing efficiency. This is traditionally done using a data cache, which takes advantage of locality to improve memory transfer patterns and reduce data traffic. However, the data cache approach suffers from the following defects. 1) Many real-time applications (such as communications, networking, graphics, etc.) do not exhibit the high data locality that is necessary for efficient functioning of a data cache. In fact, data caches may actually decrease performance substantially in networking applications. 2) Data caches are normally transparent to the programmer; thus the programmer has very little control over its functioning, and cannot optimize cache accesses to reduce data traffic when the memory access pattern is known. 3) Networking applications typically require access by the CPU firmware to memories designed for burst transactions (e.g., DRAMs and synchronous memories); it is difficult to optimize for these transactions without programmer intervention.