Main memory, such as dynamic random-access memory (DRAM), provides high storage densities. However, main memory is relatively slow. Caching, i.e. storing duplicate data in cache memory, is intended to improve the performance of a processor by providing high speed access to data if the data already is in the cache memory. An effective cache thereby reduces the number of accesses to the main memory. When a processor requests a main memory access (i.e. a read or write) at a new main memory address, the address and associated data will also be stored in the cache memory. Each address and its associated data in the cache are stored as an object called a cache entry. A cache hit occurs if the processor requests information at a main memory address, and a matching cache entry is stored in the cache memory. A cache miss occurs if a matching cache entry is not stored in cache memory.
In order to detect a cache hit or miss for high speed accesses, multiple address comparisons are performed in parallel to determine if any address in a cache entry matches the requested address. Because cache memory contains both cache entry storage and address comparison hardware, cache storage is relatively expensive to other memory in terms of the amount of hardware or silicon area per bit of data storage. Therefore, in a balanced system, cache memories are able to cache much less data than can be stored in main memory. The cache memory appears invisible to the processor, such that all processor reads and writes appear to operate as if they occurred to the main memory.
If a cache hit occurs on a processor read, the cache memory supplies the requested data to the processor in less time than that which is required for receiving the same data directly from the larger main memory. If a cache hit occurs on a processor write, then a write will be directed to the corresponding item in the cache. In a copy-back cache, the main memory is left unchanged, and the updated value in the cache is marked as changed or dirty with respect to the main memory. The processor is able to operate more efficiently because of the cache memory, as it is able to resume normal processing without waiting for the main memory.
If a cache miss occurs on a processor read, then a copy of the requested data is retrieved from the main memory stored in the cache memory. The requested data also is sent to the processor. If a cache miss occurs on a processor write, then the cache is updated with the new data.
When the cache is filled with valid data and a cache miss occurs, a new data item displaces an older one in the cache. The displaced data item or victim data is flushed out of the cache. If the victim data is dirty, then it should be written back to the main memory. Otherwise, the victim data is discarded if it is not different from its corresponding value in main memory.
An isolated write to main memory is usually a low-delay (or low latency) event, as an address value and the data to be written at the address are presented to the main memory together. In contrast, a read from main memory is relatively high latency, as it starts with the presentation of an address to the main memory, followed by a relatively long wait before the data appears from the main memory.
In relatively slow main memory systems, the minimum time allowed between responses to any memory access events is also relatively long. As such, a memory access may cause the memory system to hold off or delay a subsequent access until it is ready. A rapid sequence of cache misses can result in the main memory holding off the cache for each cache miss, which in turn must hold off the processor. A result is slow memory access rates or low data throughput into main memory.
In most processing systems, a compiler is used to compile high-level language concepts into machine code for execution on a processor. Some calculations can be performed at compile-time, while others are performed at run-time—when the program is running on the processor. Values that may change during run-time are dynamic, whereas values that are compiled as constant during run-time are static. The sizes of many declared storage objects are often static, although their contents are usually dynamic. For example, sometimes data storage size requirements are constant and can be statically calculated at compile-time, while other memory allocation is problem-size dependent and must be dynamically calculated and dynamically allocated at run-time.
An array is an arrangement of information in one or more dimensions, such as a one-dimensional list, or a color component of a two-dimensional image. Multi-dimensional arrays are those arrays with two or more dimensions. The individual data items in the array are called elements. All elements in the array are of the same data type so all elements in the array are also the same size. The array elements are stored contiguously in the computer's memory. The address of the first element in the array is the array base address.
An element within an array can be accessed through indexing. In a higher-level computer language, indexing is applied to an array, usually by following the array name with square brackets that indicates the offset or the distance between the cell to be accessed and the first cell of the array. For example, the statement X[0] accesses the first element in the X array, and the statement X[1] accesses the second element in the X array.
In a computer language, the size of an array is declared to reserve an amount of memory within which the array elements are contained. The amount of memory required to hold an array can be measured in machine bytes. A byte is the smallest addressable unit used in a processor—usually equal to 8 bits. The number of bytes required to hold an array is equal to the number of elements in the array times the size of each element in bytes. Processors usually access data from a cache in machine words, which are multiples of bytes and are usually powers-of-two multiples such as 2, 4, or 8 bytes.
For one-dimensional array access used in higher level languages, the compiler generates machine code instructions to: scale the one-dimensional array index by the element size (in machine bytes) to form a byte offset; add the offset to the array base byte address in memory to form an element address into main memory; and read data from or write data to the main memory, starting at the element address, and where the number of bytes transferred equals the element size in bytes. To avoid adding an additional offset to the index in the first step above, most popular higher-level languages take the first element of the array as that element with index 0. For dynamically changing index values, these index calculations must be performed for each array access at run-time, and are relatively slow.
Two-dimensional indexing uses two indices. For example, the index X[2][3] can access a two-dimensional array. By convention, in two-dimensional arrays, the first index value (2 in the example) is regarded as the row index, and the second is the column index value (3 in the example). One way of accessing memory as a multi-dimensional array is to use information about the number and size of dimensions in an array to calculate an index mapping expression. The index mapping expression is then used to map multiple indices applied to an array onto a single index that can then be applied to the memory as if it is a one-dimensional array. In a higher-level language, the number and size of array dimensions is obtained from an array declaration. For example, a two-dimensional array may be declared using code like:int arry[height][width];The corresponding index mapping function is:index_1d=row_index*width+col_indexThis expression maps row and column indices onto a one-dimensional index. Note that the row index is multiplied by width, whereas the column index is multiplied by 1. In an index mapping function, the larger the scale factor applied to an index, the more major it is. In the example above, a row-major, column-minor indexing scheme is used. Note that the width of the array is used in the index mapping expression in the two-dimensional case. In general, all array dimensions except the most major dimension are required to calculate the index mapping expression. A higher-level language compiler can use the array declaration to generate machine code to evaluate the index mapping expression at run-time as a function of the index values applied to the array.
This process of reducing all access to a simple memory address means that the processor data cache has to handle intermingled array data access and non-array data access. The fragmented nature of most non-array data access makes it very difficult to infer any type of data access patterns for arrays within the data cache.
In some caches, wide data paths (compared to the machine word size) between the main memory and cache can sometimes result in newly requested data words being already present in the cache from a previous nearby request. In most programs, because of the fragmented nature of data access in general, the use of wide access paths results in reading data from memory that is often never used. In general, wide paths between the main memory and the cache result in a considerable increase in memory bandwidth, with only a small corresponding reduction in cache miss-rates.
The first time each new item is requested, it does not exist in the cache yet, so a cache miss occurs on all new data. If the cached data is then used relatively few times, the number of memory accesses raises in proportion to the number of requests for array data from the processor via the cache. In the extreme case, if the processor uses each array item only once, then the cache is also useless in enhancing processor performance. This data re-use factor is algorithm dependent, but is at its worst in very simple algorithms such as for copying arrays.
For low data re-use, the relative frequency of cache misses is the primary cause of slow average memory access performance. In many simpler DSP algorithms, the processor spends as much or more time on memory access than on the actual arithmetic operations performed on the data itself.
Therefore, for large arrays there is a need in the art to provide a system and method that overcomes these problems by providing hardware acceleration of array element access and by speculatively pre-loading array data in a cache memory.