A vector here is defined as an ordered list of scalar values. A simple vector in a computer's memory is defined as having a starting address, a length (number of elements), and a stride (constant distance in memory between elements). For example, an array stored in memory is a vector. Vector processors process vector instructions that fetch a vector of values from a memory sub-system, operate on them and store them back to the memory sub-system. Basically, vector processing is the Single Instruction Multiple Data (SIMD) parallel processing technique known in the art. On the other hand, scalar processing requires one instruction to act on each data value.
Vector processor performance is strongly dependent on occurrences of resource conflicts within the memory sub-system that the vector processor accesses. These conflicts render a portion of the peak memory bandwidth unusable and inaccessible to the system containing the vector processor as a whole. Such resource conflicts also increase the average memory access latency of the memory sub-system. In systems where multiple vectors are simultaneously active, conflicts can occur between accesses to the same vector, known as intra-vector conflicts, or between accesses to different vectors, known as inter-vector conflicts.
The causes of memory sub-system resource conflicts are numerous. However, they relate in particular to the use of interleaved memory sub-systems and/or to the use of memory components with heterogeneous architectures. Modern Dynamic Random Access Memory (DRAM) technology, for example, is typically organised hierarchically into banks and pages. The order in which these partitions of the memory array within the memory component are activated, significantly influences the performance of the memory component. In addition to the hierarchical structuring of these devices, some technologies such as RAMBUS™ Direct Random Access Memory (RDRAM™) and Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM) etc support bank or page interleaving. This feature facilitates a pipelined approach to memory access whereby transactions can be issued at a rate not limited by the latency of the memory sub-system so long as certain requirements are met.
There are two traditional approaches to optimising memory sub-system performance. These are commonly applied according to two possible access policies namely the open-page policy and the closed page policy.
Open-page Policy
Once a page has been opened for access, subsequent accesses to that page can be performed with relatively low latency. In the open page policy case, a conflict is defined as a page-crossing event. When this happens, several extra cycles are required to restore the current open page back to the DRAM core and extract (or open) the page containing the next requested data. When repeated accesses to the same page can be sustained, transactions can be issued at an increased rate, but that rate remains a fraction of the system clock frequency. This is because, even in this low latency mode of operation, several cycles are required to complete each transaction. Furthermore, since interleaving is not possible, each transaction must complete before the next is issued. The peak transaction rate of the memory system is limited by the lowest latency mode of operation and is achieved by repeatedly accessing the same page of memory.
The most commonly used approach to exploiting this mode of operation is to burst access each vector. In effect, data that are anticipated to be required in subsequent computations are pre-fetched.
This approach suffers from a number of drawbacks including:                The maximum transaction rate remains relatively low in comparison to the clock frequency;        Relatively large caches are required to buffer the burst data close to the data processing units. In this context, the term data refers to either instructions fetched or operands of said instructions; and        Data-dependencies in the memory access pattern may invalidate pre-fetched data requiring repeated fetches to acquire the correct data.        
Closed-Page Policy
As an alternative to the open-page policy, a closed page policy can be used, especially when the memory sub-system has an interleaved architecture. So long as transactions are issued according to the requirements of the interleaved memory system, they can be issued every clock cycle. For example, the memory system may have a minimum latency of four cycles and a four-fold interleaved architecture. In this case, to maximise transaction issue rate, no single sub-unit of the memory system may be accessed more frequently than once in every four clock cycles. When this is achieved, the peak transaction rate is not limited by memory sub-system latency; instead, it is limited only by the system clock frequency. In this context, a sub-unit of memory refers to the level of hierarchy in the memory sub-system at which interleaving applies. A closed page policy conflict is defined as a failure to maintain the access frequency to an interleaved sub-unit of memory below the maximum operating frequency of that sub-unit. These requirements are met by avoiding repeated accesses to the same sub-unit of memory and revisiting the sub-unit at a maximum frequency defined as the reciprocal of the memory sub-system latency.
One method that is commonly used in an attempt to reduce conflict frequency in interleaved memories is address re-mapping. This technique assumes that each stream is accessed in a linear fashion, usually with a stride of 1. If the assumption holds, then swapping bits of the address bus appropriately ensures that vector accesses are always conflict-free. Effectively, address re-mapping ensures that the vector is distributed across the memory sub-system in a way that meets the requirements of the interleaving. Address re-mapping is applied statically in general, but could be applied dynamically, in principle.
There are several deficiencies in this approach including:                The technique can only help to reduce intra-vector conflicts;        Statistically, it cannot improve inter-vector conflict frequency;        In light of the first two points, address re-mapping is really only effective in a burst-oriented pre-fetch mode of operation as in open page policy. Therefore, as in the case of open page policy, relatively large caches are required close to the processing units, and data-dependencies in access patterns may invalidate some pre-fetches; and        Vectors are not always accessed with a stride of 1. Often the access pattern does not resemble any well-defined stride that could be rendered conflict-free by address re-mapping.        