This invention relates to computers. In particular the invention relates to massively parallel computers having processor arrays and methods for using arrays of processors to solve problems. Specific embodiments of the invention are particularly useful for image processing.
Image processing is both computationally intensive and data intensive. By way of example, using an MPEG (xe2x80x9cMotion Picture Experts Groupxe2x80x9d) image compression algorithm to compress a 20 Megabytes-per-second television signal in real time may require on the order of 200 billion arithmetic operations per second. The goal of providing cost effective computer systems capable of providing the extremely high throughput required for image processing and similar tasks has so far eluded the computer industry.
One way to achieve higher throughput in computer image processing systems is to use a higher speed processor. The processor could be any of several types commonly in use, such as RISC (reduced instruction set computer), CISC (complex instruction set computer), DSP (digital signal processor), or VLIW (very long instruction word). A basic problem with applying a high speed processor to data intensive applications such as image processing is that the processor typically spends a significant amount of time moving data to and from the memory. Further, when a single processor is used, the inherently parallel nature of many image processing algorithms must be broken down by the programmer into a serial program which works with one or at most a few pixels at a time.
Another common approach to achieving real-time performance in difficult image processing applications is to build custom hardware to perform the image processing. To do so, a problem is typically broken down into its main functional steps, and each step is implemented by different hardware sub systems. The hardware may be provided on an application specific integrated circuit (ASIC) or the like. Such hardware-based solutions do not typically scale up very well to larger image sizes, nor are they readily applicable to other problems.
A further way to achieve higher throughput is to divide the image processing task between many processor elements (PEs). For inherently two-dimensional (2D) problems, such as image processing, which deal with 2-dimensional arrays of data elements, such as pixels, it is natural to arrange a number of processing elements so that each processing element is logically arranged at a node of a 2-dimensional grid. Local connections are provided between neighbouring processors. A natural way to implement many 2D problems is to assign a single processor element to each data element. That is, to provide processor elements arranged at nodes of a mesh which has the same dimensions as the array of data elements that it manipulates. There are many examples of the use of computer processor arrays for solving image processing and other computational problems.
An architecture that assigns only a few data elements per processor element is termed xe2x80x9cfine-grainedxe2x80x9d. In contrast, a coarse grained architecture has many data elements assigned to each processor element. M. J. Flynn Very High Speed Computing Systems, Proceedings of the IEEE, Vol. 54, No. 12, pp. 1901-1909 (1966) categorized parallel processing computing systems into three categories: SIMD (single instruction stream, multiple data streams), MIMD (multiple instruction streams, multiple data streams) and MISD (multiple instruction streams, single data stream). In a SIMD system, the same instruction is broadcast to all processor elements. Each processor element has its own set of registers along with some means for it to receive unique data (such as a data value for a particular pixel in an image). In SIMD systems each individual processor element can be simple because it does not require a separate program counter or logic for fetching instructions from memory. Consequently, SIMD arrays can be well suited for fine-grained architectures.
In MIMD architectures every processor element has its own program store and can operate independently of other processor elements. A MIMD processor array may also be termed a xe2x80x9cmulti-computerxe2x80x9d, because each processor element is full computer in its own right. MIMD architectures are not as well suited to fine-grained problems such as image processing because each processor element in a MIMD array is more complicated than, and requires larger circuits than, its counterpart in a SIMD array. Further, inter-processor contention for shared resources is an issue because the processor elements in a MIMD array operate independently.
In MISD architectures a single stream of data is passed along a chain of processors with a different operation performed at each step in the chain. Systems which implement MISD architectures are more commonly referred to as systolic arrays, and are well suited to signal processing and video scan line processing, but not well suited to problems such as image compression that require two-dimensional operations.
In a SIMD array it is difficult to implement algorithms where one group of processor elements is required to operate differently from another group of processor elements. In some SIMD architectures individual processor elements can conditionally skip instructions (SIMD architectures without this capability can achieve the effect of condition statements through more complicated mathematical expressions).
Models for studying and modelling parallel computing have been proposed in which there are multiple instruction streams each of which is provided to a specific set of processing elements and multiple data streams. Such models are termed MSIMD models. Typically each instruction stream is associated with a specific data stream.
A key problem with using any parallel array of processors is to program the processors in the array in such a way that the parallelism is well utilized (i.e. so that a good proportion of the processors are kept busy most of the time). As a simple example, consider the following conditional branch structure, coded in the C programming language. Such a conditional sequence might occur where the behaviour of some processor elements (e.g. processor elements processing pixels which are located at the boundary of an image) needs to be different from all other processor elements.
if (r0==0)
{
/* Sequence A for non-boundary pixels*/
. . .
}
else
{
/* Sequence B for boundary pixels*/
. . .
}
In this example, r0 is the symbolic name for a register in each processor element. The processor element executes either sequence A or sequence B depending on the state of its r0 register. It can be appreciated that if sequence A and sequence B are equally long then each processor element will be utilized only 50% of the time because it will have to skip one or other of the conditional branches. The processor elements all receive the same instruction stream. While a processor element is skipping instructions it is not performing useful work.
A table lookup operation is another example of inefficient utilization of a parallel array. Consider a table lookup operation wherein each processor element is required to retrieve an element from a table based on the contents of a register. Table lookup operations of this type are used commonly, for example, to implement such tasks as colour correction, contrast enhancement, or texture mapping. Typically the table is much larger than the memory available at each processor element. Even if there were sufficient data storage at each processor element it would be a poor use of memory resources to have a copy of the same table in the memory of every processor element. Since each processor element requires access to a specific element of the table either the table will be stored in an external memory the entire table must be broadcast to every processor element. If the table is stored in an external memory then there will be contention problems caused by a large number of processor elements attempting simultaneously to access the table. If the table is broadcast to all of the processor elements then each processor element waits until the appropriate table value is broadcast, and stores only this value. It ignores all other values. It can be appreciated that processor utilization is very low during such look-up operations. Even if the contents of a table are broadcast to processor elements in a number of data streams each processing element must do significant work to obtain the one value from the table that it requires. This increases power consumption of the processor array.
An important characteristic of massively parallel architectures is the way in which processor elements are interconnected with one another. Various interconnection schemes are known. For example, U.S. Pat. No., 4,314,349 discloses a typical architecture wherein each processor element is connected to its immediate neighbours to the xe2x80x9cnorthxe2x80x9d, xe2x80x9csouthxe2x80x9d, xe2x80x9ceastxe2x80x9d, and xe2x80x9cwestxe2x80x9d. A problem with such limited connectivity is that any translation operation (combination of horizontal and vertical shifts) can only be implemented as a single processor element step at a time. This is especially a problem for any algorithm that needs to compute a single result that involves all data elements, such as determining the maximum pixel value in an image. In a xe2x80x9cfour connected neighbourhoodxe2x80x9d architecture as exemplified by U.S. Pat. No. 4,314,349, it takes at least Rxc3x97C operations to obtain such a value, where R is the number of rows in the processor array and C is the number of columns in the processor array. The overall result is that individual processor elements spend a lot of time idle while values propagate through the rest of the array. A further problem with such limited connectivity is that the array cannot readily process volumetric (three dimensional) image data because the PEs cannot be reconfigured into a mesh representing a three dimensional structure.
It is also known to connect processor elements at a border of an array to corresponding processor elements on the opposite border. U.S. Pat. No. 5,590,356 discloses an example of such a xe2x80x9ctorusxe2x80x9d architecture. While improving the efficiency of certain image operations, a torus architecture still does not help the global evaluation problem, and it introduces long wiring paths (from one edge of the array to another) that impose lower limits on the data transfer rate between processor elements because of the propagation delays along these long paths.
Some architectures have a much higher degree of connectivity. For example, U.S. Pat. No. 4,805,091, describes an array of processor elements logically arranged at nodes of a many-dimensional hyper-cube and a message routing system which permits each processor element to pass packets of data to another processor element with few intervening steps. While it can achieve more efficient processor utilization than the architectures described above, this type of architecture is difficult to implement in a monolithic array. Long path propagation delays adversely affect the scaleability of the system.
Large arrays of processors can often be made fault tolerant so that, if one or more processors are defective, their functions can be assumed by spare processors. There are a number examples of fault tolerant processor arrays in the academic and patent literature including those disclosed in U.S. Pat. Nos. 4,314,349; 5,625,836; 5,590,356; 5,748,872; 5,956,274; and, 4,722,084. Fault tolerance in memory arrays (e.g. as described by U.S. Pat. Nos. 6,032.264, and 5,920,515) has proven very beneficial to reducing their price because fault tolerance greatly increases the yield of operational chips. This is especially important because memories are typically very high density, and so especially sensitive to defects. It is much more difficult to provide a fault tolerant processor array than it is to provide a fault tolerant memory array because the cells in a memory array do not need to communicate with each other as do the processors in a processor array. So if a defect in a memory array is avoided by replacing an entire row or column, it is not necessary for the replacement row or column to be located physically adjacent to the defect. However, in a processor array, any fault correction scheme must replace the defective cell in such a way that all the local interconnections are implemented.
There is a need for cost effective computer systems capable of efficiently handling multi-dimensional problems, such as image processing. There is a particular need for such systems capable of handling streams of data, such as video image data in real time. There is a particular need for such systems which are scalable through a wide range of array sizes with a minimum of software or hardware changes.
This invention provides arrays of processor elements which have advantages over the prior art. One aspect of the invention provides a processor array comprising a plurality of interconnected processor elements, a plurality of instruction buses connected to each of the processor elements, at least one data bus connected to each of the processor elements and a instruction selection switch associated with each of the processor elements. Different processors in the array can be performing instructions in different instruction streams. Each processor element is connected to execute instructions from one of the plurality of instruction buses as selected by its instruction selection switch.
In preferred embodiments each of the processing elements comprises an instruction bus selection register and the instruction selection switch is constructed to select a one of the plurality of instruction buses corresponding to a data value in the instruction bus selection register. The contents of the instruction bus selection register can be changed under software control.
Most preferably the array comprises a plurality of data buses connected to each of the processor elements. A data selection switch associated with each of the processor elements can be used to select one of the data buses. Each processor element can be connected to receive data from a one of the plurality of data buses selected by its data selection switch. The data buses are not necessarily associated with any particular instruction stream.
In preferred embodiments, 1 wherein each of the processor elements is connected to send data to and receive data from other processor elements in a cruciate neighbourhood.
Another aspect of the invention provides a processor array comprising a plurality of interconnected processor elements. Each of the processor elements is logically arranged at an intersection of a row and a column in a grid comprising a plurality of rows and a plurality of columns. Each of the processor elements is connected to transmit data to a plurality of neighbouring processor elements. The plurality of neighbouring processor elements comprising a number N greater than 1 of processor elements in the column on either side of the processor element and a number M greater than 1 of processor elements in the row on either side of the processor element. Preferably N greater than 4 and M greater than 4. Most preferably M=N=2n+1, wherein n is an integer and nxe2x89xa71. In currently preferred embodiments Nxe2x89xa79 and Mxe2x89xa79.
A further aspect of the invention provides a method for operating a processor array comprising a plurality of processor elements. Each of the processor elements has a plurality of registers which require periodic refreshing at a refresh frequency. The method comprises providing one or more streams of instructions to each of the processor elements for execution by the processor elements and, periodically inserting into the one or more instruction streams register refresh instructions, the register refresh instructions causing the processor elements to rewrite data values in the registers. Preferably the processor element is left in the same state after execution of a refresh instruction as it was before execution of the refresh instruction. This permits refresh instructions to be inserted at any time, as required.
A still further aspect of the invention provides a method for operating a processor array having a plurality of interconnected processor elements. The method comprises providing an array of processor elements, each of the processor elements logically arranged at an intersection of a row and a column in a grid comprising a plurality of rows and a plurality of columns. Each of the processor elements is connected to transmit data to a plurality of neighbouring processor elements, the plurality of neighbouring processor elements comprising a number N of processor elements in the column on either side of the processor element and a number M of processor elements in the row on either side of the processor element. The method continues by determining when one or more of the processor elements is defective; and, for each defective one of the processor elements, ignoring either the row or column containing the defective one of the processor elements. The shape of the neighbour hoods permits rows and/or columns to be ignored while preserving the functionality of the processor array.
A still further aspect of the invention provides a method for implementing a table lookup operation in a processor array. The method comprises: providing a processor array comprising a plurality of processor elements; providing multiple data streams to each processor element; providing a lookup table comprising several parts each part corresponding to a range of values, each of the parts comprising one or more table values; simultaneously transmitting the several parts of the lookup table on the multiple data streams; at each processor element selecting a data stream to access as a function of a data value in the processor element; and, at each processor element retrieving from the selected data stream a table value corresponding to the data value of the processor element.
Further features and advantages of the invention are described below.