1) Field of the Invention
This invention relates to the field of video signal processing, and, in particular, to video signal processing using an architecture having a plurality of parallel execution units.
2) Background Art
It is well known in the prior art to use multiple-instruction multiple-data systems for video signal processing. In a multiple-instruction multiple-data execution of an algorithm each processor of the video signal processor may be assigned a different block of image data to transform. Because each processor of a multiple-instruction multiple-data system executes its own instruction stream, it is often difficult to determine when individual processors have completed their assigned tasks. Therefore, a software synchronization barrier may be used to prevent any processors from proceeding until all processors in the system reach the same point. However it is sometimes difficult to determine where synchronization barriers are required. If a necessary barrier is omitted by a user then the resulting code may be nondeterministic and re-execution of the code on the same data may yield different results.
An alternate architecture known in the prior art is single-instruction multiple-data architecture. Single-instruction, multiple-data is a restricted style of parallel processing lying somewhere between traditional sequential execution and multiple-instruction multiple-data architecture having interconnected collections of independent processors. In the single-instruction, multiple-data model each of the processing elements, or datapaths, of an array of processing elements or datapaths executes the same instruction in lock-step synchronism. Parallelism is obtained by having each datapath perform the same operation on a different set of data. In contrast to the multiple-instruction, multiple-data architecture, only one program must be developed and executed.
Referring now to FIG. 1, there is shown prior art single-instruction multiple-data architecture 100. A conventional single-instruction multiple-data system, such as architecture 100, comprises a controller 112, a global memory 126 and execution datapaths 118a-n. A respective local memory 120a-n may be provided within each execution datapath 118a-n. Single-instruction multiple-data architecture 100 performs as a family of video signal processors 118a-n united by a single programming model.
Single-instruction multiple-data architecture 100 may be scaled to an arbitrary number n of execution datapaths 118a-n provided that all execution datapaths 118a-n synchronously execute the same instructions in parallel. In the optimum case, the throughput of single-instruction multiple-data architecture 100 may theoretically be n times the throughput of a uniprocessor when the n execution datapaths 118a-n operate synchronously with each other. Thus, in the optimum case, the execution time of an application may be reduced in direct proportion to the number n of execution datapaths 118a-n provided within single-instruction multiple-data architecture 100. However, because of overhead in the use of execution datapaths 118a-n, this optimum is never reached.
Architecture such as single-instruction multiple-data architecture 100 works best when executing an algorithm which repeats the same sequence of operations on several independent sets of highly parallel data. For example, for a typical image transform in the field of video image processing, there are no data dependencies among the various block transforms. Each block transform may be computed independently of the others.
Thus the same sequence of instructions from instruction memory 124 may be executed in each execution datapath 118a-n. These same instructions are applied to all execution datapaths 118a-n by way of instruction broadcast line 116 and execution may be independent of the data processed in each execution datapath 118a-n. However, this is true only when there are no data-dependent branches in the sequence of instructions. When data-dependent branches occur, the data tested by the branch will, in general, have different values in each datapath. It will therefore be necessary for some datapaths 118a-n to execute the subsequent instruction and other datapaths 118a-n to not execute the subsequent instruction. For example, the program fragment of Table I clips a value v between a lower limit and an upper limit:
TABLE I ______________________________________ local v; . . . v = expression if (v &gt; UPPER.sub.-- LIMIT) v = UPPER.sub.-- LIMIT; if (v &lt; LOWER.sub.-- LMIT) v = LOWER.sub.-- LIMIT; ______________________________________
The value being clipped, v, is local to each execution datapath 118a-n. Thus, in general, each execution datapath 118a-n of single-instruction multiple-data architecture 100 executing the program fragment of Table I may have a different value for v. In some execution datapaths 118a-n the value of v may exceed the upper limit, and in others v may be below the lower limit. Other execution datapaths 118a-n may have values that are within range. However the execution model of single-instruction multiple-data architecture 100 requires that a single identical instruction sequence be executed in all execution datapaths 118a-n.
Thus some execution datapaths 118a-n may be required to idle while other execution datapaths 118a-n perform the conditional sequence of Table I. Furthermore, even if no execution datapaths 118a-n of single-instruction multiple-data architecture 100 are required to execute the conditional sequence of the program fragment of Table I, all execution datapaths 118a-n would be required to idle during the time of the conditional sequence. This results in further inefficiency in the use of execution datapaths 118a-n within architecture 100.
Another problem with systems such as prior art single-instruction multiple-data architecture 100 is in the area of input/output processing. Even in conventional uniprocessor architecture a single block read instruction may take a long period of time to process because memory blocks may comprise a large amount of data in video image processing applications. However, this problem is compounded when there is a block transfer for each enabled execution datapath 118a-n of architecture 100 and datapaths 118a-n must compete for access to global memory 126. For example, arbitration overhead may be very time consuming.
The alternative of providing each execution datapath 118a-n with independent access to external memory 126 is impractical for semiconductor implementation. Furthermore, this alternative restricts the programming model so that data is not shared between datapaths 118a-n. Thus further inefficiency results due to the suspension of processing of instructions until all the block reads are completed. This may be seen in the discrete cosine transform image kernel of Table II:
TABLE II ______________________________________ for (i = 0; i &lt; NUMBEROFBLOCKS; i = i + 4) { k = i + THIS.sub.-- DP.sub.-- NUMBER; read.sub.-- block(orignal.sub.-- image[k],temp.sub.-- block); DCT.sub.-- block(temp.sub.-- block); write.sub.-- block(xform.sub.-- image[k], temp.sub.-- block); }; ______________________________________
The read.sub.-- block and write.sub.-- block routines of the instruction sequence of Table II must be suspensive. Each must be completed before the next operation in the kernel is performed. For example, read.sub.-- block fills temp.sub.-- block in local memory 120a-n with all of its local values. These local values are then used by DCT.sub.-- block to perform a discrete cosine transform upon the data in temp.sub.-- block. Execution of the discrete cosine transform must wait for all of the reads of the read.sub.-- block command of all execution datapaths 118a-n to be completed. Only then can the DCT.sub.-- block and write.sub.-- block occur. Thus, by the ordering rules above, read.sub.-- block must be completed before the write.sub.-- block is processed, or the DCT.sub.-- block is executed.
Referring now to FIG. 2, there is shown processing/memory time line 200. The requirements imposed by the ordering rules within single-instruction multiple data architecture 100 result in the sequentialization of memory transactions and processing as schematically illustrated by processing/memory time line 200. In time line 200, memory read.sub.-- block time segment 202 of execution datapath 118a-n must be completed before processing of DCT.sub.-- block time segment 204 may begin. Processing DCT.sub.-- block time segment 204 must be completed before memory write.sub.-- block time segment 206 may begin. Only when memory write.sub.-- block time segment 206 of a execution datapath 118a-n is complete, can memory read.sub.-- block time segment 208 of a execution datapath 118a-n begin. Execution and access by second execution datapath 118a-n is sequentialized as described for the first.
This problem occurs in high performance disk input/output as well. In a typical disk input/output operation an application may require a transfer from disk while continuing to process. When the data from disk are actually needed, the application may synchronize on the completion of the transfer. Often, such an application is designed to be a multibuffered program. In this type of multibuffered program, data from one buffer is processed while the other buffer is being filled or emptied by a concurrent disk transfer. In a well designed system the input/output time is completely hidden. If not, the execution core of single-instruction multiple-data architecture 100 is wait-stated until the data becomes available. This causes further degrading of the performance of the single-instruction multiple-data architecture 100.