1. Field of the Invention
This invention relates generally to special purpose memory integrated in general purpose computer systems, and specifically to a memory system for efficient handling of vector data.
2. Description of the Related Art
In the last few years, media processing has had a profound effect on microprocessor architecture design. It is expected that general-purpose processors will be able to process real-time, vectored media data as efficiently as they process scalar data. The recent advancements in hardware and software technologies have allowed designers to introduce fast parallel computational schemes to satisfy the high computational demands of these applications.
Dynamic random access memory (DRAM) provides cost efficient main memory storage for data and program instructions in computer systems. Static random access memory (SRAM) is faster (and more expensive) than DRAM and is typically used for special purposes such as for cache memory and data buffers coupled closely with the processor. In general a limited amount of cache memory is available compared to the amount of DRAM available.
Cache memory attempts to combine the advantages of quick SRAM with the cost efficiency of DRAM to achieve the most effective memory system. Most successive memory accesses affect only a small address area, therefore the most frequently addressed data is held in SRAM cache to provide increase speed over many closely packed memory accesses. Data and code that is not accessed as frequently is stored in slower DRAM. Typically, a memory location is accessed using a row and column within a memory block. A technique known as bursting allows faster memory access when data requested is stored in a contiguous sequence of addresses. During a typical burst, memory is accessed using the starting address, the width of each data element, and the number of data words to access, also referred to as xe2x80x9cthe stream lengthxe2x80x9d. Memory access speed is improved due to the fact there is no need to supply an address for each memory location individually to fetch or store data words from the proper address. One shortfall of this technique arises when data is not stored contiguously in memory, such as when reading or writing an entire row in a matrix since the data is stored by column and then by row. It is therefore desirable to provide a bursting technique that can accommodate data elements that are not contiguous in memory.
Synchronous burst RAM cache uses an internal clock to count up to each new address after each memory operation. The internal clock must stay synchronized with the clock for the rest of the memory system for fast, error-free operation. The tight timing required by synchronous cache memory increases manufacturing difficulty and expense.
Pipelined burst cache alleviates the need for a synchronous internal clock by including an extra register that holds the next piece of information in the access sequence. While the register holds the information ready, the system accesses the next address to load into the pipeline. Since the pipeline keeps a supply of data always ready, this form of memory can run as fast as the host system requests data. The speed of the system is limited only by the access time of the pipeline register.
Multimedia applications typically present a very high level of parallelism by performing vector-like operations on large data sets. Although recent architectural extensions have addressed the computational demands of multimedia programs, the memory bandwidth requirements of these applications have generally been ignored. To accommodate the large data sets of these applications, the processors must present high memory bandwidths and must provide a means to tolerate long memory latencies. Data caches in current general-purpose processors are not large enough to hold these vector data sets which tend to pollute the caches very quickly with unnecessary data and consequently degrade the performance of other applications running on the processor.
In addition, multimedia processing often employs program loops which access long arrays without any data-dependent addressing. These programs exhibit high spatial locality and regularity, but low temporal locality. The high spatial locality and regularity arises because, if an array item n is used, then it is highly likely that array item n+s will be used, where xe2x80x9csxe2x80x9d is a constant stride between data elements in the array. The term xe2x80x9cstridexe2x80x9d refers to the distance between two items in data in memory. The low temporal locality is due to the fact that an array item n is typically accessed only once, which diminishes the performance benefits of the caches. Further, the small line sizes of typical data caches force the cache line transfers to be carried out through short bursts, thereby causing sub-optimal usage of the memory bandwidth. Still further, large vector sizes cause thrashing in the data cache. Thrashing is detrimental to the performance of the system since the vector data spans over a space that is beyond the index space of a cache. Additionally, there is no way to guarantee when specific data will be placed in cache, which does not meet the predictability requirements of real-time applications. Therefore, there is a need for a memory system that handles multi-media vector data efficiently in modern computer systems.
The present invention provides memory management for an extension to a computer system architecture that improves handling of vector data. The extension, known also as a vector transfer unit (VTU), includes a compiler-directed memory interface mechanism by which vector data sets can be transferred efficiently into and out of the processor under the control of the compiler. Furthermore, the hardware architectural extension of the present invention provides a mechanism by which a compiler can pipeline and overlap the movement of vector data sets with their computation.
Accordingly, the VTU provides a vector transfer pipelining mechanism which is controlled by a compiler. The compiled program partitions its data set into streams, also referred to as portions of the vector data, and schedules the transfer of these streams into and out of the processor in a fashion which allows maximal overlap between the data transfers and the required computation. To perform an operation such as y=f(a,b) in which a, b, and y are all large vectors, the compiler partitions vectors a, b, and y into segments. These vector segments can be transferred between the processor and the memory as separate streams using a burst transfer technique. The compiler schedules these data transfers in such a way that previous computation results are stored in memory, and future input streams are loaded in the processor, while the current computation is being performed.
The compiler detects the loops within an algorithm, schedules read and write streams to memory, and maintains synchronization with the computation. An important aspect of the VTU is that the vector streams bypass the data cache when they are transferred into and out of the processor. The compiler partitions vectors into variable-sized streams and schedules the transfer of these streams into and out of the processor as burst transactions.
A vector buffer is a fixed-sized partition in the vector buffer pool (VBP) which is normally allocated to a single application program and is partitioned by the compiler among variable-sized streams each holding a vector segment.
Data is transferred into and out of the VBP using special vector data instructions. One set of instructions perform the transfer of data between the memory and the vector buffers. Another pair of instructions move the data between the vector buffers and the general-purpose registers (both integer and floating-point registers).
In the present invention, one or more application programs are consecutively processed in the computer system. Each application program issues vector data transfer instructions for transferring vector data between the memory and a vector transfer unit. The vector data transfer instructions are posted to an instruction queue in the VTU. In order to perform a burst transfer, the present invention includes program instructions for determining the starting address of the vector data to be transferred, the ending address of the vector data to be transferred, and whether the ending address of the vector data to be transferred is within one memory page of the starting address. The ending address of the vector data to be transferred is determined based on the number of data elements to be transferred, the stride of the vector data to be transferred, and the width of the vector data elements to be transferred.
In one embodiment of the present invention, the ending address of the vector data to be transferred is determined based on shifting the width of the data elements to be transferred by the stride of the vector data to be transferred. In this embodiment, the amount of data to be transferred is divisible by a factor of two, which allows the multiplication of the stride and width of the data elements to be carried out by shifting.
In another embodiment of the present invention, the ending address of the vector data to be transferred is determined in parallel with determining the starting address of the vector data to be transferred.
One feature of the present invention is that an address error exception occurs when the ending address of the vector data to be transferred is not within one memory page of the starting address. When the data processing system is equipped with virtual memory, the memory page is a virtual memory page.
The foregoing has outlined rather broadly the objects, features, and technical advantages of the present invention so that the detailed description of the invention that follows may be better understood.