The present invention relates to parallel computing, more particularly to mesh connected computing, and even more particularly to the distribution, processing and reconstruction of images by means of a mesh connected computer having fewer processing elements than the size of the image.
In a number of technological fields, such as digital signal processing of video image data, it is necessary to perform substantially identical logical or arithmetic operations on large amounts of data in a short period of time. Parallel processing has proven to be an advantageous way of quickly performing the necessary computations. In parallel processing, an array of processor elements, or cells, is configured so that each cell performs logical or arithmetic operations on its own data at the same time that all other cells are processing their own data. Machines in which the logical or arithmetic operation being performed at any instant in time is identical for all cells in the array are referred to by several names, including Single Instruction-Multiple Data (SIMD) machines.
A common arrangement for such a machine is as a rectangular array of cells, with each interior cell being connected to its four nearest neighboring cells (designated north, south, east and west) and each edge cell being connected to a data input/output device. In this way, a mesh of processing elements is formed. Accordingly, the term xe2x80x9cMesh Connected Computerxe2x80x9d (MCC) is often applied to this architecture.
In a MCC, each cell is connected as well to a master controller which coordinates operations on data throughout the array by providing appropriate instructions to the processing elements. Such an array proves useful, for example, in high resolution image processing. The image pixels comprise a data matrix which can be loaded into the array for quick and efficient processing.
Although SIMD machines may all be based upon the same generic concept of an array of cells all performing the same function in unison, parallel processors vary in details of cell design. For example, U.S. Pat. No. 4,215,401 to Holsztynski et al. discloses a cell which includes a random access memory (RAM), a single bit accumulator, and a simple logical gate. The disclosed cell is extremely simple and, hence, inexpensive and easily fabricated. A negative consequence of this simplicity, however, is that some computational algorithms are quite cumbersome so that it may require many instructions to perform a simple and often repeated task.
U.S. Pat. No. 4,739,474 to Holsztynski et al., represents a higher level of complexity, in which the logic gate is replaced by a full adder capable of performing both arithmetic and logical functions. This increase in the complexity of the cell""s computational logic allows fewer cells to provide higher performance.
U.S. patent application Ser. No. 08/112,540, which was filed on Aug. 27, 1993 now U.S. Pat. No. 6,073,185, in the name of Meeker, and U.S. patent application Ser. No. 09/057,482, which was filed on Apr. 9, 1998 now U.S. Pat. No. 6,173,388, in the name of Andrew P. Abercrombie et al. each describe still further improvements in SIMD architecture computers.
As mentioned above, MCCs prove especially useful in applications such as high resolution image processing. Various types of sensors are capable of producing large quantities of data signals (henceforth referred to simply as xe2x80x9cdataxe2x80x9d) that, when taken together, constitute an xe2x80x9cimagexe2x80x9d of the sensed object or terrain. The term xe2x80x9cimagexe2x80x9d is used broadly throughout this specification to refer not only to pictures produced by visible light, but also to any collection of data, from any type of sensor, that can be considered together to convey information about an object that has been sensed. In many applications, the object or terrain is sensed repeatedly, often at high speed, thereby creating many images constituting a voluminous amount of data. Very often, the image data needs to be processed in some way, in order to be useful for a particular application. While it is possible to perform this processing xe2x80x9coff-linexe2x80x9d (i.e., at a time after all of the data has been collected), the application that mandates the collection of image data may further require that the images be processed in xe2x80x9creal-timexe2x80x9d, that is, that the processing of the image data keep up with the rate at which it is collected from the sensor. Further complicating the image processing task is the fact that some applications require the sensing and real-time processing of images that are simultaneously collected from two or more sensors.
Examples of the need for high-speed image processing capability can be found in both military and civil applications. For example, future military weapon platforms will use diverse suites of high-data-rate infrared, imaging laser, television, and imaging radar sensors that require real-time automatic target detection, recognition, tracking, and automatic target handoff-to-weapons capabilities. Civil applications for form processing and optical character recognition, automatic fingerprint recognition, and geographic information systems are also being pursued by the government. Perhaps the greatest future use of real-time image processing will be in commercial applications like medical image enhancement and analysis, automated industrial inspection and assembly, video data compression, expansion, editing and processing, optical character reading, automated document processing, and many others.
Consequently, the need for real-time image processing is becoming a commonplace requirement in commercial and civil government markets as well in the traditional high-performance military applications. The challenge is to develop an affordable processor that can handle the tera-operations-per-second processing requirement needed for complex image processing algorithms and the very high data rates typical of video imagery.
One solution that has been applied to image processing applications with some success has been the use of high-performance digital signal processors (DSP), such as the Intel i860 or the Texas Instruments (TI) TMS320C40, which have architectures inspired by high-performance military vector processing algorithms, such as linear filters and the fast Fourier transform. However, traditional DSP architectural characteristics, such as floating point precision and concurrent multiply-accumulate (vector) hardware components, are less appropriate for image processing applications since they process with full precision whether it is needed or not.
New hardware architectures created specifically for image processing applications are beginning to emerge from the military aerospace community to satisfy the demanding requirements of civil and commercial image processing applications. Beyond the high input data rates and complex algorithms, the most unique characteristics of image processing applications are the two-dimensional image structures and the relatively low precision required to represent and process video data. Sensor input data precision is usually only 8 to 12 bits per pixel. Shape analysis edge operations can be accomplished with a single bit of computational precision. While it is possible that some other operations may require more than 12 bits, the average precision required is often 8 bits or less. These characteristics can be exploited to create hardware architectures that are very efficient for image processing.
Both hard-wired (i.e., algorithm designed-in hardware) and programmable image processing architectures have been tried. Because of the immaturity of image processing-algorithms, programmable image processing architectures (which, by definition, are more flexible than hard-wired approaches) are the most practical. These architectures include Single Instruction Single Data (SISD) uniprocessors, Multiple Data Multiple Instruction (MIMD) vector processors, and Single Instruction Multiple Data (SIMD) two-dimensional array processors.
Massively parallel SIMD operating architectures, having two-dimensional arrays of processing elements (PE), each operating on a small number of pixels, have rapidly matured over the last 10 years to become the most efficient architecture for high-performance image processing applications. These architectures exploit image processing""s unique algorithm and data structure characteristics, and are therefore capable of providing the necessary teraoperation-per-second support to image processing algorithms at the lowest possible hardware cost.
Where required by the algorithm suite, the SIMD bit serial PE is flexible enough to perform 1 bit or full precision floating point operations. In most cases, the highest possible implementation efficiencies are often achieved because excess hardware in the SIMD architecture is seldom idle, in contrast to those solutions which employ DSP hardware for image processing. Two-dimensional SIMD image processing architectures also mirror the two-dimensional image data structures to achieve maximum interprocessor communication efficiency. These processors typically use direct nearest neighbor (i.e, north, south, east, and west) PE connections to form fine-grained, pixel-to-processor mapping between the computer architecture and the image data structure. The two-dimensional grid of interconnections provides two-dimensional SIMD architectures with inherent scalability. As the processing array is increased in size, the data bandwidth of the inter-PE bus (i.e, two-dimensional processor interconnect) increases naturally and linearly.
The fastest image processing time could be achieved by configuring the size of a PE array to exactly match the expected size of the largest image to be processed. In such a configuration, one would need only to load the entire image into the array, control the PE array to perform the image processing algorithm, and then read out the results. However, in order for a parallel processing system to be commercially feasible, the quantity of parallel processing elements in a system must be significantly smaller than the number of pixels in the incoming image. When this is the case, the incoming image must be broken down into smaller sub-images which are then separately processed and then reconstructed for output. For flexibility, the system should also support variable-sized input and output images, preferably by simply reprogramming the sub-image distribution scheme.
For example, consider the case in which an Nxc3x97M PE array is embodied on a single integrated circuit (IC), with each of the interior PE""s connected to its four nearest neighbors (NORTH, EAST, SOUTH, and WEST). A larger array, for example an 5Nxc3x975M array, can be constructed by configuring an array of these ICs (e.g., a 5xc3x975 array of these ICs) on a circuit board (henceforth referred to simply as xe2x80x9cboardxe2x80x9d). Still greater processing power can be arranged by designing a system that includes multiple boards.
Within any given IC, each of the PEs is coupled to its nearest neighbors, and is therefore capable of exchanging data with one or more of those neighbors as directed by the master controller. Similarly, the PEs arranged on any one board are often interconnected to enable the PEs along the perimeter of one IC""s PE array to exchange data with a neighboring PE located along the perimeter of a neighboring IC""s PE array. Usually, however, it is impractical to design a system that provides the ability for any PE located on one board to exchange data with any PE located on a different board within the same system.
The ability, or lack thereof, of a PE to exchange data with a neighboring PE has ramifications on how an image can best be processed by the array because many of these algorithms require that, in order to process any given pixel, information about one or more of that pixel""s neighboring pixels be available. For example, consider the exemplary image frame 100 depicted in FIG. 1. The image frame 100 comprises a 3Mxc3x972N array of pixels. Assume that a system for processing the image comprises six boards, each having an Mxc3x97N array of PEs arranged thereon. One might then divide up the image frame 100 into six frame segments 101, 103, 105, 107, 109, 111, each consisting of a unique Mxc3x97N section of the image frame 100. Each of the frame segments 101, 103, 105, 107, 109, 111 can then be supplied to a respective one of the six boards for processing. When processing is complete, the processed sections can then be collected from the individual boards and reconstructed to form a complete processed image frame.
Less than desirable results are likely to result from the above-described processing strategy. First, if the system is designed in such a way that the PEs on one board are not capable of exchanging data with the PEs located on other boards, then the processing of pixels located along the borders between adjacent frame segments will suffer from xe2x80x9cedge effectsxe2x80x9d due to interaction with xe2x80x9coff-arrayxe2x80x9d pixels instead of the actual neighboring pixels. For example, if the rows of the image frame 100 are numbered from 1 to 2N, starting from the top, and if the columns of the image frame 100 are numbered from 1 to 3M, starting from the left, then the processing of the pixel located at row 1, column M (denoted xe2x80x9cp(M, 1)xe2x80x9d) should take into account the value of the neighboring pixel located at row 1, column (M+1) (denoted xe2x80x9cp(M+1,1)xe2x80x9d). However, because these pixels have been distributed to different boards, the processing algorithm applied to each of these pixels will use an erroneous pixel value in place of the actual horizontally neighboring pixel value. Similar edge effects will result at the borders between frame segments 101, 103, 105, 107, 109, 111 in the vertical direction as well.
Furthermore, the edge effect problem can occur in connection with pixels that are located entirely within the PE array of a single board if the size of the frame segment 101, 103, 105, 107, 109, 111 is larger than the size of a single board""s PE array, thereby requiring that the frame segment 101, 103, 105, 107, 109, 111 be further subdivided into xe2x80x9csubframesxe2x80x9d that are sequentially processed by the PE array on the board. For example, suppose that an Mxc3x97N frame segment 101 is to be processed by a board having only an M/2xc3x97N/2 PE array. This can be accomplished by subdividing the Mxc3x97N. frame segment 101 into four distinct subframes, each sized at M/2xc3x97N/2. Because the PE array will have to process each of these in sequence, the PEs that process pixels located along an edge of one subframe will not be able to utilize information about the value of a horizontally or vertically neighboring pixel located along an edge of a neighboring subframe. This will result in edge effect problems.
To avoid these edge effect problems, image frames can be divided into overlapping frame segments, whereby some pixels may be assigned to two or more frame segments. For example, consider the image frame 200 shown in FIG. 2. The exemplary image frame 200 consists of a 720xc3x97480 array of pixels. In order to permit the image frame 200 to be processed in a system having six boards, each board having its own PE array that does not exchange data with any other PE array, the image frame 200 can be divided into six frame segments (FSs) 207, each dimensioned as a 300xc3x97300 pixel array. As can be seen in FIG. 2, dimensioning the frame segments 207 in this manner means that there are areas of overlap between adjacent frame segments 207. In this example, we have the following situation:
the pixels located in the rightmost 90 columns of the frame segment 207 assigned to board 1 also make up the leftmost 90 columns of the frame segment 207 assigned to board 2;
the pixels located in the rightmost 90 columns of the frame segment 207 assigned to board 2 also make up the leftmost 90 columns of the frame segment 207 assigned to board 3;
the pixels located in the rightmost 90 columns of the frame segment 207 assigned to board 4 also make up the leftmost 90 columns of the frame segment 207 assigned to board 5;
the pixels located in the rightmost 90 columns of the frame segment 207 assigned to board 5 also make up the leftmost 90 columns of the frame segment 207 assigned to board 6;
the pixels located in the bottommost 120 columns of the frame segment 207 assigned to board 1 also make up the topmost 120 columns of the frame segment 207 assigned to board 4;
the pixels located in the bottommost 120 columns of the frame segment 207 assigned to board 2 also make up the topmost 120 columns of the frame segment 207 assigned to board 5; and
the pixels located in the bottommost 120 columns of the frame segment 207 assigned to board 3 also make up the topmost 120 columns of the frame segment 207 assigned to board 6.
Because there are varying degrees of both horizontal and vertical overlap, pixels may be assigned to one, two or four boards, depending on their location within the frame image 200. For example, some pixels, such as those located in region 201, are assigned to four boards. Pixels located on other border regions, such as region 203 and region 205, are assigned to only two boards. Pixels not located in any overlap region are assigned to just one board. This strategy provides a mechanism for eliminating edge effects, as will be illustrated by the following example. When board 1 processes its frame segment 207, edge effects will be produced for pixels lying in region 205, because the PEs on board 1 will not have access to the pixel values lying to the right of region 205. However, those pixels lying in region 203 do not suffer from this problem because the PEs on board 1 do have access to the pixel values lying to the right in region 205.
Similarly, when board 2 processes its frame segment 207, edge effects will be produced for pixels lying in region 203, because the PEs on board 1 will not have access to the pixel values lying to the left of region 203. However, those pixels lying in region 205 do not suffer from this problem because the PEs on board 2 do have access to the pixel values lying to the left in region 203.
After all of the boards have finished their processing, a complete processed image frame without edge effects is reconstructed by using board 1""s results for those pixels lying in region 203, and board 2""s results for those pixels lying in region 205.
A similar strategy is adopted for processing all other overlapping regions in image frame 200, both horizontal and vertical. The dotted lines in FIG. 2 illustrate from which board the processed results are taken to reconstruct a complete processed image.
This overlapping strategy can similarly be used within a single board, when the frame segment 207 needs to be further divided into subframes that will be sequentially processed by the PE array on that board.
It is possible to design and construct dedicated hardware that will perform the necessary input/output (I/O) to move pixels into and out of PE array boards when the size of the image frame, number of boards, and size of the PE array on a board is fixed. However, to make for a more commercially viable, flexible image processing architecture, capable of processing variable sized image frames and further capable of adapting to system configurations having a variable number of boards, it is desirable to provide techniques and apparatuses that simplify the process of inputting frame segments 207 into a plurality of boards, distributing possibly overlapping subframes to PE arrays on a given board, and reconstruct a processed image frame from the processed frame segments generated by the boards.
In accordance with one aspect of the present invention, the foregoing and other objects are achieved in methods and apparatuses for selectively distributing a plurality of data items to a plurality of hardware destinations that share a common bus. This involves, for each one of the data items, utilizing a distribution technique that includes determining which of the hardware destinations the data item should be distributed to, wherein at least one of the data items should be distributed to two or more hardware destinations. The data item is then supplied to the common bus; and for each of the hardware destinations to which the data item should be distributed, a corresponding hardware destination signal is generated that causes the data item to be received in the hardware destination from the common bus, wherein for each data item, the corresponding hardware destination signals are generated substantially simultaneously. In this manner, each data item to be distributed need be placed on the common bus only once, even if it is to be distributed to more than one hardware destination.
In another aspect of the invention, each of the hardware destinations may be a processor board in a multiprocessor system.
In yet another aspect of the invention, the hardware destination signal may be generated from one or more control words that are retrieved from respective one or more control memories.
In still another aspect of the invention, each bit in the one or more control words may uniquely correspond to one of the processor boards.
In yet another aspect of the invention, the hardware destination signal may be generated by logically ANDing two or more control words. For example, one control word may be associated with rows of processor boards, and another control word may be associated with columns of processor boards. If a same bit position in both the row and column control words has an asserted bit (e.g., a binary xe2x80x9c1xe2x80x9d), then that processor board will be one of the hardware destinations for the data item.
In alternative embodiments of the invention, each of the hardware destinations may be one of a plurality of input memory devices that are commonly installed on a processor board.
In these embodiments as well, the hardware destination signal may be generated from a control word that is retrieved from a control memory. Furthermore, each bit in the control word may uniquely correspond to one of the input memory devices.
In another aspect of the invention, each of the input memory devices may be associated with a corresponding one of a plurality of channels on the processor board, and each of the channels may be associated with a corresponding one of a plurality of processing element arrays.
In still another aspect of the invention, the plurality of data items may form a frame segment that is partitioned into a plurality of overlapping subframes; each of the data items that should be distributed to two or more hardware destinations may be associated with an overlap region formed by at least two of the overlapping subframes; each of the input memory devices may be associated with a corresponding one of a plurality of channels on the processor board; and each of the channels may be associated with a corresponding one of a plurality of addressable storage devices. Furthermore, for each of the channels, data items are loaded into the corresponding addressable storage device from the corresponding input memory device.
In yet another aspect of the invention, the step of, for each of the channels, loading data items into the corresponding addressable storage device from the corresponding input memory device, may be performed such that, for each of the channels, each data item that is associated with an overlap region associated with vertically overlapping subframes is stored at only one location within the corresponding one of the plurality of addressable storage devices.
In still another aspect of the invention, each of the channels is associated with a corresponding one of a plurality of processing element arrays. Furthermore, for each of the channels, data items are loaded into the corresponding one of the processing element arrays from the corresponding addressable storage device. In each of the processing element arrays, a processed subframe is then formed, and the processed subframe is aligned so that at least one edge row of processing elements in the processing element array includes a selected row of processed data items, wherein the selected row of processed data items includes at least one processed data item that will be supplied as an output data item from the processor board.
In yet another aspect of the invention, in each of the processing element arrays, a processed subframe may be formed in which each processed data item is marked to indicate whether it is to be retained or discarded.
In still another aspect of the invention, for each of the channels, the processed subframe may be loaded from the corresponding processing element array into the corresponding addressable storage device.
In yet another aspect of the invention, each of the channels may be associated with a corresponding one of a plurality of output storage devices. Furthermore, for each of the channels, a data item is conditionally loaded from the corresponding addressable storage device into the corresponding output storage device only if the data item is marked for retention.
In still another aspect of the invention, the plurality of data items forms an image frame that is partitioned into a plurality of overlapping frame segments; and each of the data items that should be distributed to two or more hardware destinations is associated with an overlap region formed by at least two of the overlapping frame segments.
In yet another aspect of the invention, the plurality of data items may form a frame segment that is partitioned into a plurality of overlapping subframes; and each of the data items that should be distributed to two or more hardware destinations is associated with an overlap region formed by at least two of the overlapping subframes.
The invention further involves methods and apparatuses for forming a sequence of data items by selectively collecting a plurality of data items from a plurality of processor boards in a multiprocessor system, wherein the processor boards share a common bus. This is done by, for each one of the data items in the sequence to be formed, performing a collection procedure that includes retrieving a board selection word from each of one or more control memories; generating a processor board selection signal from the retrieved one or more board selection words; using the processor board selection signal to selectively cause one of the processor boards to supply the data item to the common bus; and collecting the data item from the common bus, whereby the plurality of data items are collected from the plurality of processor boards in an order that is determined by an order in which the board selection words are retrieved from the one or more control memories.
In another aspect of the invention, each bit in the one or more board selection words uniquely corresponds to one of the processor boards.
In still another aspect of the invention, the step of retrieving the board selection word from each of one or more control memories includes retrieving a board selection word from each of two or more control memories; and the step of generating the processor board selection signal comprises generating the processor board selection signal by logically ANDing the retrieved two or more board selection words.
In yet another aspect of the invention, each of the processor boards comprises a processor array.
In other aspects of the invention, methods and apparatuses are provided that process a subframe that comprises a plurality of data items. In accordance with one aspect, this is performed by loading the subframe into a processing element array that comprises a plurality of processing elements arranged in a rectangular array having four processing element array edges, each defined by a respective one of first and second processing element edge rows and first and second processing element edge columns. In the processing element array, a processed subframe is formed that comprises at least one non-retained edge portion and a remaining portion, wherein the non-retained edge portion alternatively comprises one or more contiguous rows, or one or more contiguous columns of processed data items that will not be retained. Then, in the processing element array, the processed subframe is aligned such that at least one of the processing element array edges stores an edge row or column of the remaining portion of the processed subframe.
In yet another aspect, the step of, in the processing element array, aligning the processed subframe includes shifting the processed subframe within the processing element array until a first processing element array edge stores the edge row or column of the remaining portion of the processed subframe. As a result, a first rectangular group of the processing elements is formed that has an edge that is opposite the first processing element array edge, and that stores data items that will not be retained, wherein the data items stored in the first rectangular group of the processing elements constitute a first rectangular group of non-retained data items.
In still another aspect, the shifted processed subframe is then moved from the processing element array to an addressable memory device, wherein the edge row or column of the remaining portion of the processed subframe overwrites a second rectangular group of non-retained data items that was previously moved from the processing element array to the addressable memory device. This is useful for assembling a larger processed image in the addressable memory device.
In other aspects, subframe processing is performed by loading the subframe into a processing element array, and forming a processed subframe in which each processed data item is marked to indicate whether the processed data item is to be retained or discarded.
In yet another aspect of the invention, one of the processed data items is then conditionally loaded into an output storage device only if the processed data item is marked for retention.
In still another aspect of the invention, the processed subframe is first loaded from the processing element array into an addressable memory. In these embodiments, one of the processed data items may be conditionally loaded from the addressable memory into the output storage device only if the processed data item is marked for retention.