The field of the invention is digital processor architecture and, particularly, architectures for processing data in parallel operations.
Digital processors may take a variety of forms which depend on the nature of the processing functions that are to be performed. As shown in FIG. 1a, for example, the most common architecture is a microprocessor 100 which is coupled to a memory 101 by a data bus 102 and an address bus 103. The buses 102 and 103 are typically from 4 bits to 32 bits wide and the memory 101 stores both the control program that directs the microprocessor 100 to perform its functions on data which is also stored in the memory 101. The control program is comprised of a set of instructions which the microprocessor 100 is designed to recognize and execute. This architecture is convenient to use because of the well defined instruction set, but it is slow because the same memory 101 must be accessed sequentially for both control program instructions and data.
Where higher performance is required, a so-called bit-slice architecture is often employed. As shown in FIG. 1b, the bitslice processor 104 executes instructions which it receives from a program memory 105 through a microcode bus 106. In response to the microcode instructions, the bit-slice processor 104 operates on data in a data memory 126. The bit-slice processor 104 operates a microcode address bus 107 to sequence through the control program, and it operates an address bus 108 and a data bus 109 to read and write data to the data memory 126. Because the control program is stored separately, the microcode instructions can be pre-fetched and retained in an internal register within the bit-slice processor 104 while the bit-slice processor 104 is still carrying out the execution of the previous microcode instruction. The fetching of control program instructions is carried out in parallel with the execution of previous instructions, thus reducing the time required to access external memories. While bit-slice processors are fast and extremely flexible, they are more difficult to work with because the designer must, in essence, define the microcode instruction set and provide all of the design and maintenance tools to program and maintain the microcode.
To overcome the complexities associated with developing microcode, chip sets have been developed which take advantage of the speed of bit-slice processor architecture, but trade off flexibility for a standard instruction set which is easier to use. Such an architecture is shown in FIG. 1c, where the program memory 110 stores control program instructions comprised of a well defined instruction set that is recognized by three separate units: a program sequencing unit (PSU) 111; an integer processor unit (IPU) 112; and a floating point unit (FPU) 113. The PSU 111 functions to address the proper program instruction through a code address bus 114 and to handle branching, calls to subroutines and interrupts, and the return from subroutines and interrupts. The IPU 112 executes certain of the instructions appearing on the code bus 115, including logical operations, Boolean operations and integer arithmetic operations. The FPU 113 responds to instructions calling for floating point arithmetic operations and it is considered an optional device which need not be used in all applications. The IPU 112 and the FPU 113 operate on data stored in a data memory 116 through a data bus 117 and an address bus 118. The parallel fetching of control program instructions is thus achieved, but a standardized instruction set is employed to develop the control program stored in the program memory 110.
The power, or capability, of any of these architectures can be increased in a number of ways. First, the clock speed, can be increased so that the control program is executed more quickly. Secondly, the number of bits in the data bus and processor may be increased so that higher precision operations can be performed in a single instruction execution time. And finally, the functions to be performed may be divided and allocated to separate processing units which operate in parallel with each other. Such parallel processors may employ one or more types of the above processors which are interconnected through shared memories, data links and the like, and which are coordinated by a master, or host, processor to carry out all of the functions to be performed. While such architectures substantially reduce processing time by performing functions simultaneously, or in parallel, the cost of replicating the processor units can be too high for many applications.
There are applications where identical operations are performed on each set of a plurality of sets of data, and significant reductions in processing time can be achieved by assigning processor units to operate on each separate data set. This is particularly true in systems for processing medical images such as those produced by computed axial tomography (CAT) X-ray systems and nuclear magnetic resonance (NMR) systems. Medical imaging systems characteristically acquire many sets of data representing different "views" of the patient. Each view is an array of intensity data which is processed in identical fashion to reconstruct an image. The processing of the data acquired for each view is identical and is very intensive. While such data can easily be processed more quickly by a set of processors, each operating simultaneously and in parallel on data from a view or slice, the cost of replicating processors in large number is prohibitive.