The present invention relates to data processing and, more particularly, to multiprocessor data processing. A major objective of the invention is to provide for improved performance in a pipelined series of processors.
Much of modern progress is associated with advances in computer technology. A classical computer system includes a processor and memory. The memory includes memory storage locations, each with a unique memory address. The contents of the memory include data and instructions. The instructions constitute one or more computer programs according to which data is to be worked (manipulated). The processor reads the instructions from memory and executes them. According to the instructions executed, the processor reads data from memory, manipulates the data, and writes data to memory.
The processor itself can include an instruction decoder, an execution unit, a set of registers, and a program counter. The instruction decoder decodes instructions for execution by the execution unit. The execution unit uses the registers as a very small and very fast local memory, for example, to store coefficients, operands, partial results and final results for various operations called for by the instructions. The program counter is, in effect, a self-incrementing register so that, by default, instructions are fetched from memory in the order of the addresses of the memory locations at which they are stored. This default order can be changed as called for by various conditional and unconditional "jump" instructions.
The earliest processors had to complete execution of an instruction before the next instruction was fetched. Since an instruction fetch from memory is relatively time consuming, processors are usually designed with instruction buffers to hold instructions, e.g., at successive memory locations, likely to be used next. Still a delay would be involved where processing of one instruction had to wait until execution of the previous instruction was complete.
Pipelining, analogous to assembly lines in manufacturing, was introduced to minimize the latency between instruction executions. For example, in a two-stage pipeline, one instruction can be decoded while the previous instruction is being executed. Multi-stage pipelines break instruction execution into stages, with each stage processing a different instruction. The latency between instructions is reduced roughly in proportion to the number of stages in the pipelined.
Architectural advances such as pipelining, along with increasing circuit densities and clock speeds, have provided dramatic improvements in processor performance over the years. At any given level of processor development, further advances in performance could be achieve by using multiple processors. While parallel arrangements of processors are more widely known, serial arrangements of processors can be used to great advantage in many situations.
In a serial multiprocessor system, a first data set can be worked by a first processor. The data, as worked by the first processor, can then be worked by a second processor; in the meantime, a second data set can be worked by the first data processor. Then, the first data set can be worked by a third processor, the second data set worked by the second processor, and a third data set worked by the first processor. For example, the first processor performs error detection and correction, the second processor analyzes the data to determine what further processing should be performed, and the third processor performs the further processing. The data is thus pipelined through the series of processors to reduce the latency between data sets. In this case, the pipeline stages are not execution stages within a processor, but separate processors within the series.
The performance of a pipelined series of processors is adversely affected to the extent of any latency involved in transferring data from one processor to the next. If the transfer involves reading and writing to a common main memory, this latency can be quite large. The latency can be much reduced if some or all of the data can be transferred directly from the registers of the upstream processor to those of the downstream processor.
One limitation of this approach is that the memory capacity of a processor's register set is typically very small (so that access speeds can be high). For example, a typical processor might only be able to access sixteen 32-bit registers. (A processor might have a greater number of registers, but only be able access a limited number of these without switching operating modes.) Nonetheless, there are applications that involve sufficiently small data sets that all or most data that needs to be transferred can be transferred directly between processors. For example, once an upstream processor completes its processing of a data set, it can transfer the data from its registers to a downstream processor using "store" (or "store multiple") commands, while the downstream processor uses "load" (or "load multiple") commands to transfer the data into its registers.
Of course, there is still a latency involved in the inter-processor transfer of data. Even with such efficient instructions as "load multiple" and "store multiple", the latency measured in clock cycles will exceed the number of registers involved in the data transfer. What is needed is a pipelined serial processor system that minimizes this latency to provide for greater system throughput.