The present invention relates to parallel computing, and more particularly to mesh connected computing.
In a number of technological fields, such as digital signal processing of video image data, it is necessary to perform substantially identical logical or arithmetic operations on large amounts of data in a short period of time. Parallel processing has proven to be an advantageous way of quickly performing the necessary computations. In parallel processing, an array of processor elements, or cells, is configured so that each cell performs logical or arithmetic operations on its own data at the same time that all other cells are processing their own data. Machines in which the logical or arithmetic operation being performed at any instant in time is identical for all cells in the array are referred to by several names, including Single Instruction-Multiple Data (SIMD) machines.
A common arrangement for such a machine is as a rectangular array of cells, with each interior cell being connected to its four nearest neighboring cells (designated north, south, east and west) and each edge cell being connected to a data input/output device. In this way, a mesh of processing elements is formed. Accordingly, the term "Mesh Connected Computer" (MCC) is often applied to this architecture.
In a MCC, each cell is connected as well to a master controller which coordinates operations on data throughout the array by providing appropriate instructions to the processing elements. Such an array proves useful, for example, in high resolution image processing. The image pixels comprise a data matrix which can be loaded into the array for quick and efficient processing.
Although SIMD machines may all be based upon the same generic concept of an array of cells all performing the same function in unison, parallel processors vary in details of cell design. For example, U.S. Pat. No. 4,215,401 to Holsztynski et al. discloses a cell which includes a random access memory (RAM), a single bit accumulator, and a simple logical gate. The disclosed cell is extremely simple and, hence, inexpensive and easily fabricated. A negative consequence of this simplicity, however, is that some computational algorithms are quite cumbersome so that it may require many instructions to perform a simple and often repeated task.
U.S. Pat. No. 4,739,474 to Holsztynski et al., represents a higher level of complexity, in which the logic gate is replaced by a full adder capable of performing both arithmetic and logical functions. This increase in the complexity of the cell's computational logic allows fewer cells to provide higher performance.
U.S. patent application Ser. No. 08/112,540, now U.S. Pat. No. 6,073,185 which was filed on August 27, 1993 in the name of Meeker, describes still further improvements in SIMD architecture computers.
It is important to note that the various improvements in this technological field, such as the substitution of a full adder for a logic gate, while superficially simple, are in reality changes of major consequence. The cell structure cannot be allowed to become too complex. This is because in a typical array, the cell will be repeated many times. The cost of each additional element in terms of money and space on a VLSI chip is therefore multiplied many times. It is therefore no simple matter to identify those functions that are sufficiently useful to justify their incorporation into the cell. It is similarly no simple matter to implement those functions so that their incorporation is not realized at too high a cost.
Parallel processors may also vary in the manner of cell interconnection. As mentioned above, cells are typically connected to their nearest physical neighbors. All cells except those at the edge of the entire array are typically connected to four neighbors. However, the provision of alternate paths of interconnection may produce additional benefits in the form of programmable, flexible interconnection between cells.
As mentioned above, MCCs prove especially useful in applications such as high resolution image processing. Various types of sensors are capable of producing large quantities of data signals (henceforth referred to simply as "data") that, when taken together, constitute an "image" of the sensed object or terrain. The term "image" is used broadly throughout this specification to refer not only to pictures produced by visible light, but also to any collection of data, from any type of sensor, that can be considered together to convey information about an object that has been sensed. In many applications, the object or terrain is sensed repeatedly, often at high speed, thereby creating many images constituting a voluminous amount of data. Very often, the image data needs to be processed in some way, in order to be useful for a particular application. While it is possible to perform this processing "off-line" (i.e., at a time after all of the data has been collected), the application that mandates the collection of image data may further require that the images be processed in "real-time", that is, that the processing of the image data keep up with the rate at which it is collected from the sensor. Further complicating the image processing task is the fact that some applications require the sensing and real-time processing of images that are simultaneously collected from two or more sensors.
Examples of the need for high-speed image processing capability can be found in both military and civil applications. For example, future military weapon platforms will use diverse suites of high-data-rate infrared, imaging laser, television, and imaging radar sensors that require real-time automatic target detection, recognition, tracking, and automatic target handoff-to-weapons capabilities. Civil applications for form processing and optical character recognition, automatic fingerprint recognition, and geographic information systems are also being pursued by the government.
Perhaps the greatest future use of real-time image processing will be in commercial applications like medical image enhancement and analysis, automated industrial inspection and assembly, video data compression, expansion, editing and processing, optical character reading, automated document processing, and many others.
Consequently, the need for real-time image processing is becoming a commonplace requirement in commercial and civil government markets as well in the traditional high-performance military applications. The challenge is to develop an affordable processor that can handle the tera-operations-per-second processing requirement needed for complex image processing algorithms and the very high data rates typical of video imagery.
One solution that has been applied to image processing applications with some success has been the use of high-performance digital signal processors (DSP), such as the Intel i860 or the Texas Instruments (TI) TMS320C40, which have architectures inspired by high-performance military vector processing algorithms, such as linear filters and the fast Fourier transform. However, traditional DSP architectural characteristics, such as floating point precision and concurrent multiply-accumulate (vector) hardware components, are less appropriate for image processing applications since they process with full precision whether it is needed or not.
New hardware architectures created specifically for image processing applications are beginning to emerge from the military aerospace community to satisfy the demanding requirements of civil and commercial image processing applications. Beyond the high input data rates and complex algorithms, the most unique characteristics of image processing applications are the two-dimensional image structures and the relatively low precision required to represent and process video data. Sensor input data precision is usually only 8 to 12 bits per pixel. Shape analysis edge operations can be accomplished with a single bit of computational precision. While it is possible that some other operations may require more than 12 bits, the average precision required is often 8 bits or less. These characteristics can be exploited to create hardware architectures that are very efficient for image processing.
Both hard-wired (i.e., algorithm designed-in hardware) and programmable image processing architectures have been tried. Because of the immaturity of image processing-algorithms, programmable image processing architectures (which, by definition, are more flexible than hard-wired approaches) are the most practical. These architectures include Single Instruction Single Data (SISD) uniprocessors, Multiple Data Multiple Instruction (MIMD) vector processors, and Single Instruction Multiple Data (SIMD) two-dimensional array processors.
Massively parallel SIMD operating architectures, having two-dimensional arrays of processing elements (PE), each operating on a small number of pixels, have rapidly matured over the last 10 years to become the most efficient architecture for high-performance image processing applications. These architectures exploit image processing's unique algorithm and data structure characteristics, and are therefore capable of providing the necessary tera-operation-per-second support to image processing algorithms at the lowest possible hardware cost.
The bit-serial design of most SIMD image processing architectures represents the logical and complete extension of the Reduced Instruction Set Computer (RISC) design concept. Where required by the algorithm suite, the SIMD bit serial PE is flexible enough to perform 1 bit or full precision floating point operations. In all cases, the highest possible implementation efficiencies are achieved because excess hardware in the SIMD architecture is never idle, in contrast to those solutions which employ DSP hardware for image processing. Two-dimensional SIMD image processing architectures also mirror the two-dimensional image data structures to achieve maximum interprocessor communication efficiency. These processors typically use direct nearest neighbor (i.e, north, south, east, and west) PE connections to form fine-grained, pixel-to-processor mapping between the computer architecture and the image data structure. The two-dimensional grid of interconnections provides two-dimensional SIMD architectures with inherent scalability. As the processing array is increased in size, the data bandwidth of the inter-PE bus (i.e, two-dimensional processor interconnect) increases naturally and linearly.
While a SIMD architecture makes available the raw processing power necessary to process image data in real-time, this capability is of little use if the processor is left idle whenever the surrounding hardware is either supplying image data to, or retrieving processed data from, the processor. Thus, it is necessary for the overall architecture of a real-time image processor to efficiently collect data from the sensors, supply it to the processing engine, and just as quickly move processed data out of the processing engine.
A number of these problems are addressed in a real-time image processor as described in U.S. Pat. No. 5,606,707, to Tomassi et al. In the Tomassi et al. processor, a mesh-connected array of processing elements is coupled to other components that perform such tasks as instruction generation, and image management, including the moving of images into and out of the array of processing elements. It is now recognized by the inventors of the present invention that one drawback with the system as described in the Tomassi et al. patent derives from its hardware organization. In particular, the Tomassi et al. system is implemented as five separate Application Specific Integrated Circuits (ASICs) that, together, operate as a complete system. The result is high cost, high complexity and high risk in the design of complete systems. The functional partitioning of the Tomassi et al. system also tends to limit the input/output (I/O) bandwidth, affecting the overall throughput.