2.1 Applications of the Invention
The invention is primarily intended to assist in the handling and processing of large amounts of numeric data in real time at low cost, while consuming a minimum of power and occupying a minimum of space. Such applications generally fall under the category of real time digital signal processing. The applications generally include image and video processing, pattern recognition, multimedia, and audio processing. In addition, many applications, such as communications, also can benefit from the high rate of data handling and processing provided by the invention.
2.2 Microprocessor Chips
Microprocessor chips, such as the large family of x86 chips from Intel, are generally intended for the processing of data in desk-top computing applications. While high processing speed is desirable to minimize the amount of time that the user spends waiting to obtain a result, the processing is generally not in real time because live data sources and sinks are generally not present. Much of the data is character oriented, such as for word processing, although the ability to process large amounts of numerical data in floating-point format for scientific and engineering applications is provided in the most recent microprocessor chips. Additional chips are required to facilitate the transferal of data from input, output and storage devices to the microprocessor chip. In addition, since the microprocessor chips must support vast numbers of software applications that were created many years ago, the chip architectures are intended for creation of applications that process a single datum at a time.
Methods for improving the performance of processors include the use of the Reduced Instruction Set Computer (RISC) design philosophy, the use of the Super Scalar architecture, and the use of the Very Large Instruction Word (VLIW) architecture.
With the RISC philosophy, the chip architect attempts to minimize the amount of circuitry required to build the chip while maximizing the speed at which that relatively small amount of circuitry operates. One usual consequence of this approach is that the software tools that prepare programs for execution on the chip must be intimately aware of the allowable flows of operations in the chip and exclude sequences of instructions that cannot be executed correctly by modifying the sequences. Previously, hardware in the chip was required to detect invalid sequences and temporarily suspend operation until the potential for invalid operation had passed.
With the Super Scalar and Very Large Instruction Word architectures, the processor architect observes that some portions of some adjacent, generally dissimilar, sequences of operations can be executed simultaneously while proving proper program function. The instruction set of the processor, and the amount of hardware in the processor, are constructed to facilitate the specification and execution of multiple operations simultaneously.
When using the Very Large Instruction Word architecture, processors such as those built by the now-defunct Multiflow computer company have instruction words with hundreds of bits, divided into many groups. Each group of multiple bits controls a different portion of the hardware in the processor. In such machines, numerous arithmetic-logic-units, each independently controlled, have been provided.
When using the Super Scalar architecture, the instruction unit contains control logic that allows the observation of multiple instruction words simultaneously. The number of bits in each instruction word is usually in the range of 32 to 64 bits, like most microprocessors, which is much smaller than found in Very Large Instruction Word processors. The control logic has the ability to determine when it can execute instructions out of sequence while providing normal program operation, rather than waiting for all previous instructions to execute. Thus multiple scalar operations, such as a memory operation and an operation by the arithmetic-and-logic unit, can sometimes be processed simultaneously rather than sequentially, increasing execution speed.
2.3 Digital Signal Processor Chips
Digital Signal Processor (DSP) chips, such as the Texas Instruments C80, are intended for the processing of data in real time. The rate at which data is processed and moved around must thus be rapid, but the finite processing power and I/O bandwidth of the chip impose a limit upon the amount of data and the complexity of the processing that can be performed in real time. DSP chips generally have a much smaller addressing range than provided by microprocessors because only a relatively small amount of random access memory (RAM) is required for the temporary storage and processing of live data, and because mass storage devices, such as disk drives, are rarely used.
Most DSP chips, like microprocessors, support the processing of only a single datum at one time. An exception is the Texas Instruments C80 which has one control processor and four parallel processors within it. However, these five processors operate substantially independently of one another using the multiple-instruction multiple-data (MIMD) architecture. Thus the use of the five processors in one package is substantially the same as the use of five separate processors.
2.4 Massively and Moderately Parallel Processors
2.4.1 Massively Parallel Processors
Processors with thousands to tens of thousands of processors have been built using the single-instruction multiple-data (SIMD) architecture. Examples are the now-defunct Connection Machine from Thinking Machines, Inc., and the long-defunct, Illiac-IV from Goodyear. These machines have a single instruction unit that controls the operation of all of the many processors in lock-step. It is often difficult to keep all of the processors busy because the amount of parallelism in the hardware does not match the amount of parallelism in the application, and because data-dependent operations must be performed that cause large fractions of the machine to become inactive. The physical size of such machines was large, a cubic meter or more, due to the many components required to build them, and very few machines were produced due to the high price.
2.4.2 Moderately Parallel Processors
Processors with tens to thousands of processors have been built using the multiple-instruction multiple-data (MIMD) architecture. Each of the processors is typically a common microprocessor. The many processors communicate with one another over a communications network, typically via some sort of a packet-oriented protocol. Since each processor can fetch and execute instructions independently of the others, the fraction of the processors that are busy is generally better than in large machines using the single-instruction multiple-data architecture. However, some of this improved efficiency is lost by the need to send messages from one processor to another, and it is often difficult to efficiently divide a problem among the many processors. The physical size of such machines ranged from a single, fully populated, printed circuit board to one or more large cabinets.
Relatively small, parallel processors with tens to hundreds of processors have also been built using the single-instruction multiple-data (SIMD) architecture. The interconnection of these many processors is generally between registers within the processors via serial connections in one or several dimensions. The passing of data between such registers is generally difficult to represent in high level languages which purposely hide the presence of registers and focus on the processing of variables in RAM.
An example of the data path chip, the portion of the processor containing the parallel computation elements, is the CNAPS-64 chip from Adaptive Solutions, Inc. It contains 64, 8-/16-bit computation elements, each with its own small-capacity, local memory. While high performance could be obtained once data had been moved into the data path chips, the ability to rapidly move data into and out of the data path chips was severely limited, greatly hurting performance in many applications. In addition, the amount of local memory provided to each computation element was fixed at a small value and could not be expanded, and was often not optimum for the application.
In these SIMD machines, a single, external instruction unit would drive multiple data path chips simultaneously. Such a machine, like its much larger, massively parallel cousins, often operates inefficiently when the amount of parallelism in the hardware does not match the amount of parallelism in the application, and because data-dependent operations must be performed that cause large fractions of the machine to become inactive. The physical size of such machines ranged from one to several, fully populated, printed circuit boards.
In addition, the programming of such SIMD machines generally relies upon the creation of a library of data-processing subroutines that have been hand-crafted by the builders of the machine in order for users to create applications that execute relatively efficiently and program the machine relatively easily for specific tasks.
2.5 Compilers
A severe limitation in the use of parallel processors has been the difficulty of creating applications for them. Alter all, computing hardware is useless without software to operate it. A critical problem in the programming of parallel processors has been the difficulty of representing the parallel processing. If few applications are created for new computing hardware, little of the hardware will be sold and the hardware will be a failure in the marketplace. Such failures have occurred many times.
One of the earliest forms of parallel processing was found in the vector execution units of supercomputers, such as the Cray-1 and its next several generations of successors. These execution units were intended for doing matrix arithmetic in floating-point representation on large problems such as are found in aerodynamics and the development of nuclear weapons. Due to the complexity of the vector hardware, the vendor of the supercomputer, who best understood the operation of the hardware, typically developed a library of subroutine calls for common matrix operations. These subroutines were typically incorporated into a program being developed by the user using a FORTRAN compiler.
The programming situation with respect to prior art, moderately and massively parallel processors of the single-instruction multiple-data architecture is little changed from the programming of the vector processors of the Cray-1. The common method for representing data remains the vector, which can have hundreds to thousands or more elements. Due to the complexity of the parallel hardware, the vendor of the parallel processor, who best understands the operation of the hardware, typically develops a library of subroutine calls for common operations. These subroutines are typically incorporated into a program being developed by the user using a C compiler.
The programming situation for prior-art, parallel processors of the multiple-instruction multiple-data architecture relies upon the ability of programmers to divide a task into pieces suitable for being processed individually by each of the many processors. The use of common microprocessors assists in the understanding of the operation of a processor by the programmer, since the programming of scalar processors is well known, and enables the programmer to focus on task-partitioning and inter-processor communications aspects of the application.