1. Field of the Invention
This invention relates to computer processors and computer processing system designs. Particularly, the invention relates to central processing units (CPUs) utilizing vector processors.
2. Description of the Related Art
A vector processor is a type of CPU that is employed in an architecture able to run mathematical operations on a large number of data elements very quickly. This is in contrast to a scalar processor which handles one element at a time. Currently most CPUs are scalar processors. Previously, vector processors were common in the scientific computing area, where they formed the basis of most supercomputers. But general increases in performance through improved processor designs have made most dedicated vector processors obsolete. However, today almost all large production CPU designs include some vector processing instructions, typically known as simple instruction/multiple data (SIMD).
In general, CPUs are capable of manipulating only one or two pieces of data at a time. For example, every CPU has some type of instruction to add two numbers and put the result in a particular location. To do this, the number data is usually “pointed to” by passing in an address to a memory location that holds the number data to be operated on. Decoding this address and getting the data out of the memory takes some time. As CPU speeds have increased, this time delay has become more significant.
In order to reduce the decoding delay, most conventional CPUs use a technique known as instruction pipelining, passing instructions though several sub-units in turn. For example, a first sub-unit reads the address and decodes it, a second sub-unit receives the values, and a third performs the mathematical operation. With pipelining, it is important to start decoding the next instruction even before the first has departed the CPU, so the address decoder is constantly in use. Any particular instruction takes the same amount of time to complete (i.e., the latency), but the CPU can process the entire batch much faster than if it performed each instruction completely in a serial fashion.
Vector processors take the concept of instruction pipelining further. Instead of pipelining only the instructions, vector processors also pipeline the data. For example, vector processors may be provided instructions that direct not only adding two values, but also adding all of the values in a defined range to all of the values in another defined range. Instead of constantly decoding instructions and then retrieving the data needed to complete them, vector processors read a single instruction from memory, and anticipate what the next address is, e.g. the next address increment. This technique yields significant savings in the overall decoding time.
Completing a single vector instruction may take longer than adding-two-numbers instruction in the general purpose CPU. However, this single vector instruction with represents many instructions of the general purpose CPU. The vector processor avoids much of the address decoding and also has only a single vector instruction to decode. Since the instructions are also stored in general memory, and general memory is typically very slow compared to the CPU, this technique dramatically improves overall performance by allowing the data set to be read from memory very quickly.
In addition, the vector processor is typically produced in some form of superscalar implementation. Accordingly, when a range of numbers are added, multiple parts of the CPU (e.g. two or four) perform the addition, not just one. Since the output of a vector command does not rely on the input from any other vector processor, the multiple parts can each add some of the numbers in parallel. Thus, the whole operation is completed in a fraction of the time. Vector processors are particularly suited for applications where large amounts of data are operated on. Accordingly, vector processors have historically been used in supercomputers which are specifically designed to process huge amounts of data.
In addition, recent advances in CPU design have provided high-performance systems with multiple independent vector processors in other applications as well, such as the cell processor. The cell processor employs nine cores, one acting as a controller while the remaining cores (i.e. attached processing units [APUs]) are very high performance vector processors. Each APU includes a block of very high speed memory. The APUs can operate independently or process a stream of data in combination with other APUs each working on different portions. This capability to function as a stream processor gives rise to the full processing potential of the cell processor.
Cell processors are specifically designed to function together. While they may be directly connected they may also be connected in other ways, even distributed over a network. Cell processors are not exclusively designed for any particular application, they can be in a wide range of devices, e.g. personal computers, game consoles, personal digital assistants (PDAs), televisions, other media devices. In additon, multiple cell processors can be used to effectively act as a single system. The infrastructure for this is built into each cell processor to operate on “software cells” which include routing information as well as programs and data.
Parallel processing is usually complex, requiring specific parallel programming to utilize the hardware. The cell processor does not require reprogramming; the operating system automatically reviews the available resources and distributes tasks. However, in may instances repogramming may be performed for the cell processor to obtain even greater performance for some applications, similar to the advantages obtained from programming in a low level language, e.g. machine language. Processing power is increased simply by adding more cells. Thus, the cell architecture provides distributed, parallel processing employing very powerful computational engines.
However, to achieve full speed, such processors (cell processors or other vector processor architectures) typically have limited, but extremely high performance, local memory, e.g., 256 Kbytes. Slower-speed access to a larger (e.g. multi-gigabyte) dynamic random access memory (DRAM) is also provided. To operate on problems requiring large (e.g. multi-gigabyte) datasets, some sort of caching is required. Since the hardware does not provide hardware caching capabilities in all cases, software caching may be used instead. In such instances, the processing speed of this software caching is crucial.
Typical prior art memory caching systems include hardware cache mechanisms and software hash, tree, and dense-array caching structures for mass-storage accesses. However, each of these caching approaches are difficult to directly employ with vector processors. For example, hardware cache mechanisms use dedicated special-purpose hardware to process the cache manipulation. This special-purpose hardware is not available on high-performance vector processors because this would slow down processing, increase power consumption, and reduce the number of vector units that could be place on a single die. Software hash, tree, and dense-array caching structures are typically employed for mass-strorage accesses. These structures require excessive overhead for local-memory-to-main-memory caches. This overhead is acceptable when the secondary memory is utilized on high-latency devices (e.g. mass storage devices), but not when the secondary memory is main memory DRAM. Examples of some prior art memory systems for processors are are as follows.
U.S. Pat. No. 5,379,393 by Yang issued Jan. 3, 1995, teaches a cache memory system for use during vector processing in a processor. The processor contains a CPU and a main memory. The system includes a vector cache memory, a first address register, a main memory address calculation unit, and a cache address calculation unit. The first register stores a first address associated with an instruction executed by the CPU. The main memory address calculation unit is coupled to the first address register for calculating a second address utilizing the first address and vector stride data associated with said executed instruction. The second address is utilized to access the main memory. The cache address calculation unit is coupled to both the first address register and the main memory address calculation unit for calculating the third address utilizing portions of the first address and portions of the second address. The third address is utilized to access the vector cache memory.
However, U.S. Pat. No. 5,379,393 merely uses hardware mapping based on primes related to a power of two in order to improve cache performance relative to a direct-mapped cache. The hardware caching approach adds to the hardware complexity and consumes additional power.
U.S. Pat. No. 5,148,536 by Witek et al. issued Sep. 15, 1992, teaches a load/store pipeline in a computer processor for loading data to registers and storing data from the registers that has a cache memory within the pipeline for storing data. The pipeline includes buffers which support multiple outstanding read request misses. Data from out of the pipeline is obtained independently of the operation of the pipeline, this data corresponding to the request misses. The cache memory can then be filled with the data that has been requested. The provision of a cache memory within the pipeline, and the buffers for supporting the cache memory, speed up loading operations for the computer processor.
However, U.S. Pat. No. 5,148,536 merely merges a hardware cache into a vector pipeline to allow misses to be processed without stalling the pipeline. Similar to U.S. Pat. No. 5,379,393 above, the hardware caching approach adds to the hardware complexity consumes additional power.
In view of the foregoing, there is a need in the art for a high-speed caching mechanism that can handle main memory (e.g. DRAM) as secondary storage. Accordingly, there is a consequent need for an extremely efficient cache-lookup mechanism to fully leverage vector instructions on vector processors. These and other needs are met by embodiments of the present invention as detailed hereafter.