A number of systems have been developed which employ a large array of simple bit-serial processors, each receiving the same instruction at any given time from a central controller. These types of systems are called "Single Instruction Multiple Data" (SIMD) parallel processors. There are various methods for communicating data from one processor to another. For example, the massively parallel processor described in K. E. Batcher, "Design of a Massively Parallel Processor," IEEE Transactions on Computers, September, 1980, pp. 836-840, contains an array of 128.times.128 processors where image processing is an important application. Data is communicated between neighboring processing elements when an instruction that requires a neighborhood operation is performed. Image data arrays with dimensions larger than 1024.times.1024 are not uncommon. Since processor arrays this large are not economically feasible, the array must be broken into smaller data array sizes with dimensions equivalent to the size of the processor array. There are other types of SIMD processors, but they also generally experience the problem of data arrays larger than processor array. Generally, for all these systems, all the memory associated with the processors is not large enough to hold the entire image along with extra memory capacity for intermediate computational results.
Thus, a large external memory is necessary, and mechanisms must be able to handle the input and output of small subarray segments at high speed to preserve computing efficiency. Even if enough memory were supplied to each processor, so that the total memory associated with the ensemble of processors could not contain the entire large array of image data, there would still remain the problem of communicating data between the various subarrays when neighborhood operations are performed. During an instruction clock cycle, every processor receives the output of its associated memory, so that processors on the edge of the array cannot receive data from neighboring subarrays because all memories are already engaged in reading an entire subarray. Thus, multiple clock cycles would be needed in reading data when subarray and neighboring subarray data are both needed in the computation. Generally, SIMD processors are less efficient in handling global processes where large areas of the data matrix must be analyzed, such as in histograms, feature extractions, and spatial transforms, such as Hough Transforms and Fourier Analysis.
Indirect addressing is an important processing concept, but the difficulties with implementing it in a parallel processing environment have been recognized in the literature. See, for example: A. L. Fisher and P. T. Highnam, "Real Time Image Processing on Scan Line Array Processors," IEEE Workshop on Pattern Analysis and Image Database Management, Nov. 18-20, 1985, pp. 484-489; and P. E. Danielson and T. S. Ericsson, "LIPP-Proposals for the Design of an Image Processor Array," Chapter 11, pp. 157-178, COMPUTING STRUCTURES FOR IMAGE PROCESSING (Ed. M. J. B. Duff, Academic Press, 1983). Larger amounts of memory are required for indirect addressing to be useful because applications, such as look-up-tables or histograms, which can benefit from indirect addressing, also require a large amount of memory. To usefully access such a large amount of memory, indirect addressing typically requires the use of at least byte wide address words to address byte wide data words; however, a separate byte wide indirect addressing circuit at the site of each bit-serial processor would greatly complicate the parallel processing circuit. One solution to this problem is disclosed in U.S. Pat. No. 5,129,092, wherein eight bit-serial processing elements share the burden of providing the indirect addressing of byte wide words. In that disclosure, bits of data words are read from memory external to the processing chip and are distributed to groups of eight processing units. These data words are then used as an address to the external memory.
Many of the highest performance microprocessors use internal cache memory as a means to effectively speed up memory references. Such a microprocessor is generally able to reference the internal cache memory much faster than it is able to reference external memory; therefore, the use of the internal cache memory allows the use of lower speed, lower cost, bulk memory. The lowest cost bulk memories are characterized in their ability to supply data to the microprocessor cache in bursts from consecutive sequences of addresses. A number of these so-called cache burst memories, such as, for example, video random access memories (VRAM), are now commercially available. SIMD processors, however, often require memory in address patterns that are not in a consecutive sequence. Since the above-mentioned cache burst memories cannot meet this requirement, SIMD processors must forego their use and rely on higher cost static RAM (SRAM).
State-of-the-art microprocessors generally contain a controller that is able to read instructions from memory, decode those instructions, and operate on data in accordance with the decoded instruction. When instructions and data are stored in the same memory, the microprocessor must read both of them through the same inputs, thereby reducing the effective throughput of both the instructions and data, thus further degrading the memory bottleneck. Usage of cache memory internal to the microprocessor allows instructions and data to be stored in different caches and be simultaneously addressed, thereby improving the memory bandpass internally. A controller that reads and decodes instructions in a SIMD machine has not used the same external memory for storage of instructions and data because a SIMD system generally must change the instruction every clock cycle. If the instructions reside in the same external memory as that used for storing data, the processing speed would drop in half because of the need to update the instructions before each data fetch. Thus, two external independent memory systems are generally used in SIMD systems: one to store array data and one to store instructions.
Therefore, a primary object of the present invention is to provide a simple method to allow a fixed array of processors to handle a large array of data while performing operations which require neighborhood and global processing of data.
Another object of the invention is to provide an effective method of indirect addressing of memory which operates independently for each SIMD processor in the processor array.
A further object of the invention is to provide a means of handling large arrays of data without resorting to memories and associated input and output mechanisms remote from the processing array.
Another object of this invention is to provide a means to handle contiguous high-speed bursts of data from consecutive addresses so that lower cost cache burst memories can be used.
Another object of this invention is to provide a controller means in a SIMD system that is capable of fetching both array data and instructions from the same external memory without suffering from a large loss in speed.