Two types of video processors which may access both system and local memory are well known in the prior art. It is well known in the prior art to use multiple-instruction multiple data (MIMD) systems in this manner. In a multiple-instruction multiple-data execution of an algorithm, each processor of the video signal processor may be assigned a different block of image data to transform. It is also known in the prior art to provide single-instruction multiple-data (SIMD) architecture. Single-instruction, multiple-data is a restricted style of parallel processing lying somewhere between traditional sequential execution and multiple-instruction multiple-data architecture having interconnected collections of independent processors. In the single-instruction, multiple-data model, each of the processing elements or datapaths of an array of processing elements or datapaths executes the same instruction in lock-step synchronism. Parallelism is obtained by having each datapath perform the same operation on a different set of data. In contrast to the multiple-instruction, multiple-data architecture, only one program must be developed and executed.
A conventional single-instruction multiple-data system may include a controller, a global memory and execution datapaths, although data transfers between the datapaths and system memory may be quite complex. A respective execution unit memory may be provided within each execution datapath. Single-instruction multiple-data architecture performs as a family of video signal processors united by a single programming model.
Single-instruction multiple-data architecture may be scaled to an arbitrary number n of execution datapaths provided that all execution datapaths synchronously execute the same instructions in parallel. In the optimum case, the throughput of single-instruction multiple-data architecture may theoretically be n times the throughput of a single processor when the n execution datapaths operate synchronously with each other. Thus, in the optimum case, the execution time of an application may be reduced in direct proportion to the number n of execution datapaths provided within single-instruction multiple-data architecture. However, because of overhead in the use of execution datapaths, this optimum is never reached.
Single-instruction multiple-data architecture works best when executing an algorithm which repeats the same sequence of operations on several independent sets of highly parallel data. For example, for a typical image transform in the field of video image processing, there are no data dependencies among the various block transforms. Each block transform may be computed independently of the others.
Thus, the same sequence of instructions from instruction memory may be executed in each execution datapath. These same instructions may be applied to all execution datapaths by way of an instruction broadcast line and execution may be independent of the data processed in each execution datapath.
A problem with systems such as prior art single-instruction multiple-data architecture is in the area of input/output processing. Even in conventional single processor architecture a single block read instruction may take a long period of time to process because memory blocks may comprise a large amount of data in video image processing applications. However, this problem is compounded when there is a block transfer for each enabled execution datapath of the architecture and the datapaths must compete for access to global memory. For example, arbitration overhead may be very time consuming. This is further complicated when there is communication between the execution datapaths and a number of devices in system memory space.
The alternative of providing each execution datapath with independent access to external memory is impractical for semiconductor implementation. Furthermore, this alternative restricts the programming model so that data is not shared between datapaths. Thus, further inefficiency results due to the suspension of processing of instructions until all the block reads are completed. This may be seen in the discrete cosine transform image kernel of Table I:
TABLE I ______________________________________ for (i = 0; i &lt; NUMBEROFBLOCKS; i = i + 4) { k = i + THIS-DP-NUMBER; read block(original.sub.-- image[k],temp.sub.-- block); DCT-Flock(temp.sub.-- block); write block(xform image[k], temp.sub.-- block); }; ______________________________________
The read.sub.-- block and write.sub.-- block routines of the instruction sequence of Table I must be suspensive; i.e., each routine must be completed before the next operation in the kernel is performed. For example, read.sub.-- block fills temp.sub.-- block in execution unit memory with all of its local values. These local values are then used by DCT.sub.-- block to perform a discrete cosine transform upon the data in temp.sub.-- block. Execution of the discrete cosine transform must wait for all of the reads of the read.sub.-- block command of all execution datapaths to be completed. Only then can the DCT.sub.-- block and write.sub.-- block occur. Thus, by the ordering rules above, read.sub.-- block must be completed before the write.sub.-- block is processed or the DCT.sub.-- block is executed.
The requirements imposed by the ordering rules within single-instruction multiple-data architecture result in the sequentialization of memory transactions and processing. For example, a first memory read.sub.-- block time segment of an execution datapath must be completed before processing of DCT.sub.-- block time segment may begin. Processing of the DCT.sub.-- block time segment must be completed before the memory write.sub.-- block time segment may begin. Only when the memory write.sub.-- block time segment is complete can a second memory read.sub.-- block time segment begin. Thus, execution and access by a second execution datapath is then sequentialized as described above for the first execution datapath.
Similar requirements occur in high performance disk input/output as well. In a typical disk input/output operation, an application may require a transfer from disk while continuing to process. When the data from disk are actually needed, the application may synchronize on the completion of the transfer. often, such an application is designed to be a multibuffered program. In a multibuffered program, data from one buffer is processed while the other buffer is being filled or emptied by a concurrent disk transfer. In a well designed system, the input/output time is completely hidden. If not, the execution core of single-instruction multiple-data architecture is wait-stated until the data becomes available. This causes further degrading of the performance of the single-instruction multiple-data architecture.
A system addressing some of these problems is taught in "Architecture for Video Signal Processing", U.S. patent application Ser. No. 07/782,332 filed Oct. 24, 1991 by Sprague et al, now U.S. Pat. No. 5,361,370. In the system of Sprague, et al., a single-instruction, multiple-data image processing system is provided for more efficiently using parallel datapaths when executing an instruction sequence having conditionals and greatly improved external memory access. Each datapath of the Sprague et al. image processing system has an execution unit and a local memory. Access between the execution unit and the local memory is by way of one port of a dual-ported local memory.
In this system, all transfers between the local memory and the system memory take place using the second port of the dual-ported local memory. The transfers between system and local memories are scheduled and controlled by a common unit called the block transfer controller. The block transfer controller, along with the dedicated port of the dual-ported local memory, permit each access to global memory by a datapath to be overlapped with its instruction processing. This is useful in preventing stalling of the processor. Thus the system of Sprague et al. solved several problems associated with the single-instruction, multiple-data architecture. However, it did not solve all of the problems related to transfer of data between the processor and both local and system memory, along with associated problems relating to interfaces and interrupts.