1. Field of the Invention
The present invention relates to a digital signal processing apparatus which performs computational processes for digital signals
2. Description of the Prior Art
FIG. 1 shows the multiprocessor system described in the article entitled "A Real Time Video Signal Processor Suitable for Motion Picture Coding Applications", IEEE, GLOBCOM '87, p. 453. In FIG. 1, input data 1 is received by a data transfer controller 3, and thereafter data 4 are transferred selectively to digital signal processors 2, i.e. DSP-1 through DSP-N, in block-1 After being processed by the respective DSPs in block-1, resultant data 5 is transferred to block-2 and processed by respective DSPs for the next processing step.
FIG. 2(a) shows divided memory areas of the DSPs. For the simplicity of explanation, shown here is an example of parallel processing using three DSPs 2, to which process areas A, B and C are assigned evenly.
In the inter-frame image coding system and the like, it is a general convention to employ the conditional pixel supplementary process in which only portions having at least a certain difference between the present input frame and the previous frame are coded and previous frame data is used for the remaining portions. Accordingly, the volume of computation needed for the process differs depending on the valid pixel rate even though the number of pixels in the process area is constant. The volume of computation or computation time needed is proportional to the valid pixel rate.
In the inter-frame image coding system or the like, assuming that the number of valid pixels is shared by all DSPs to have a distribution EA, EB and EC as shown in FIG. 2(b), the computation time needed for one block of parallel DSP configuration is determined from the process time of the DSP which processes data in the area B with the largest volume of process M, and the remaining DSPs which have finished processing the areas A and C earlier have idle time.
The conventional digital signal processing apparatus arranged as described above has its overall process time determined from the longest process time among DSPs when the density of information, such as the valid pixel rate, within a frame is uneven and the distribution of information varies with time, resulting in a degraded process efficiency per DSP unit.
FIG. 3 is a diagram showing, as an example, the arrangement of other digital signal processing apparatus disclosed in an article entitled "Realtime Video Signal Processor Module", in the proceeding of ICASSP '87, pp. 1961-1964, April 1987, Dallas, U.S.A. In the figure, indicated by 1 is an input terminal, 4 is an input bus for distributing input data from the input terminal 1, 28a is a feedback bus for distributing the result of previous processing, and 20 are signal processing modules each including an input storage unit 21, a processing unit 22, an output storage unit 23 and a timing control unit 24. Indicated by 25 are wired-OR circuits through which feedback data on output ports 30 are placed on the feedback bus 28a, 26 are wired-OR circuits through which output data on output ports 29 are delivered to an output terminal 5 over the output bus 5a, 27 are input ports for the input data to the signal processing module 20, and 28 are input ports for the feedback data to the signal processing modules 20.
FIG. 4 is a block diagram showing in more detail one of the signal processing modules in FIG. 3. In the figure, indicated by 221 is an address generator (AGU A), 211 is an input dual memory (MEM A) which receives data on the input port 27 over the input bus 4, 212 is an input dual memory (MEM B) which receives data on the feedback bus 28a by way of the input port 28, 222 is an address generator (AGU B), 223 is an X-bus, 224 is a Y-bus, and 225 is a pipeline arithmetic unit (PAU) having its input terminal EX1 connected to the X-bus 223 and another input terminal EX2 connected to the Y-bus 224. Indicated by 226 is a data memory [MEM P(Q)] having its output connected to the X-bus 223, 227 is an address generator [AGU P(Q)] having its output connected to the Y-bus 224 and data memory 226, 228 is a mode register (MDR) having its output connected to the X-bus 223 and Y-bus 224, and 241 is a Z-bus connected to the inputs of the address generators 221, 222 and 227, pipeline arithmetic unit 225 and data memory 226. Indicated by 242 is a sequencer (SEQ), 243 is an instruction memory (IRAM) connected to the output of the sequencer 242, and 245 is a decoder (DEC) connected to the output of the instruction memory 243, with the output of the decoder 245 being connected to the Z-bus 241 and output bus 231. The output bus 231 is connected to the input of the mode register 228 and the Z-bus 241. Indicated by 232 is a FIFO memory (MEM C) connected to the output bus 231, 233 is a FIFO memory (MEM D) connected to the output bus 231, 29 is an output port of the FIFO memory 232, and 30 is an output port of the FIFO memory 233.
FIG. 5 is a diagram showing, as an example, the system of a typical high-efficiency coder for a moving image. In the figure, indicated by 250 is an input terminal for the input video signal, 251 is an input frame buffer having at least a 1-frame capacity and having the simultaneous read-write ability, 252 is an inter-frame subtracter for evaluating the difference between adjacent frames, 253 is a block identifier, 254 is a coder, 255 is a coding parameter produced by the coder 254, 256 is a variable-length coder, 257 is a video multiplexer, 258 is a transmission buffer memory, and 259 is an output terminal for the coded data. Connected in cascade between the input terminal 250 and output terminal 259 are the above-mentioned functional blocks 251-254 and 256-258. Further indicated by 260 is a local decoder which receives the coding parameter 255, 261 is an inter frame adder, 262 is an in-loop filter, 263 is a coding frame memory, 264 is previously coded frame data, 265 is a motion compensator, 266 is current frame data fed from the input frame buffer 251 to the motion compensator 265, 267 is motion vector data, 268 is compensated previous frame data fed from the motion compensator 265 to the inter-frame subtracter 252 and inter-frame adder 261, 269 is a feedback signal, and 270 is a coding controller which provides coding control information 271 for the video multiplexer 257, a feed-forward signal 272 to the input frame buffer 251, a block identification control signal 273 to the block identifier 253, and a coding control signal 274 to the variable-length coder 256.
Next, the operation of the conventional digital signal processing apparatus will be described in connection with FIG. 3. This apparatus is intended for moving image processing and is based on the division parallel processing system in which a frame is divided into small frames and a signal processing module 20 is assigned to each of the divided frame areas.
Initially, each signal processing module 20 operates on the autonomous basis by requiring one video frame time to fetch a divided frame area assigned to it among the input data transferred frame-wise in raster scanning over the input bus 4 and to store the data in the input storage 21. At the same time, if the process result of the previous frame is needed for the current process, it operates by expending one video frame time to fetch data of the assigned area of the frame in the feedback data from the input port 28 over the feedback bus 28a and stores the data in the input storage 21.
Upon expiration of one video frame time, the processing unit 22 performs the prescribed signal processing for the input data and feedback data stored in the input storage 21, and stores the result temporarily in the output storage 23. The feedback data outfitted from the output storage 23 through the output port 30 is timed for synchronization with other signal processing modules 20 and, by being merged into all feedback data by the wired-OR circuit 25, placed on the feedback bus 28a. Similarly, the output data outputted from the output storage 23 through the output port 29 is timed for synchronization with other signal processing modules 20 and, by being merged into all output data by the wired-OR circuit 26, and delivered to the output terminal 5 over the output bus 5a.
Divided frame areas processed individually by the signal processing modules 20 are recombined into a video frame. Therefore, parallel processing of the divided areas type is realized. For the reasons as described above, it is necessary for all signal processing modules 20 to have their process commencement in complete synchronism with one another. On this account, the timing control unit 24 provides all sections of the system with the timing of data input/output and process commencement in synchronism with the video frame timing which is the synchronization reference point.
Next, the operation of one signal processing module 20 will be described in connection with FIG. 4. From a video frame entered frame-wise through the input port 27 in synchronism with the video frame sync signal, data of the assigned area is stored in the input dual memory 211. At the same time, among the coded previous frame data entered through the input port 28, the portion of the assigned area and its peripheral data are stored in the input dual memory 212.
The input dual memories 211 and 212 are made up of a two-sided memory device in the same structure on both sides and operate such that while one side is writing data, the other side is connected to the X-bus 223 and Y-bus 224 for reading by the pipeline arithmetic unit 225 to conduct a coding process. The read/write sides of the input dual memories 211 and 212 are switched by the above-mentioned video frame sync signal so that input data of assigned areas on the input ports 27 and 28 are entered frame-wise uninterruptedly.
The data read out to the X-bus 223 and Y-bus 224 are those stored at data memory addresses indicated to the input dual memories 211 and 212 by the address generators 221 and 222 that are controlled by the signals provided by the decoder 245 by decoding an 80-bit length horizon-type microcode read out in accordance with the address of the command memory 243 indicated by the sequencer 242. The data placed on the X-bus 223 and Y-bus 224 are entered in parallel to the pipeline arithmetic unit 225, which implements a series of signal processing steps including coding and local decoding and outputs the result to the Z-bus 241. Among the process outputs placed on the Z-bus, the coded output is stored in the FIFO memory 232 and the local decoded output is stored in the FIFO memory 233 by way of the output bus 231.
The FIFO memories 232 and 233 are buffer memories of FIFO configuration. Feedback data consisting of the output data and local decoded data are read out of the output ports 29 and 30 at the read control timing for the assigned area produced from the video frame signal, and an amount of video frame local decoded data and coded output data in compliance with the scanning order are produced.
The data memory 226 which is controlled by the output of the address generator 227 is used as a work memory which is necessary for the processing by the pipeline arithmetic unit 225 and a table which stores constants. The mode register 228 consists of a register file including registers for loading immediate values from the decoder 245.
This digital signal processing apparatus is principally based on the foregoing area division parallel processing, and is intended such that each signal processing module 20 deals with a divided frame area independently on a realtime basis. When the digital signal processing apparatus is intended for the achievement of a coder as shown in FIG. 5, only portions excluding the variable-length coder 256, video multiplexer 257, transmission buffer 258 and coding controller 270 can be realized. Namely, it is not suitable for a continuous process in one video frame, and is limited to the inter-frame coding loop process ranging from the input frame buffer 251 to the block identifier 253, coder 254, local decoder 260, coding frame memory 263, and to the motion compensator 265 useful for data completely divisible within a frame.
Since each signal processing module 20 implements the same process for each frame, the processing program stored in each instruction memory 243 can be a single program. When a frame is divided into M areas (M is an integer greater than or equal to 1), the number of process cycles Nc per pixel which can be dealt with on a realtime basis by one signal processing module 20 is given by the following calculation. EQU Nc=Mc.multidot.Tf/Mp.multidot.Np (clocks/pixel)
where Mc is the frequency of machine cycle (Hz), Tf is the frame period (sec), Mp is the number of horizontal pixels in the assigned area, and Np is the number of vertical pixels in the assigned area.
On this account, if a frame is divided into four areas, for example, each having the assignment of a signal processing module 20, the number of process cycles Nc is increased by four fold, and it becomes possible for the video signal processing, which is required to be very fast, to be dealt with on a realtime basis by an increased number of relatively slow signal processing modules 20.
The conventional digital signal processing apparatus arranged as described above have the following problems for processing video signals.
(a) For the achievement of very fast processing, a frame must be divided into numerous small areas, however, certain signal process system configurations do not allow independent processes for areas below a certain minimal division size. Therefore, realtime processing can not be achieved by increasing the parallelism.
(b) Because of a fixed distribution of load to signal processing modules, the process time must be set to meet the longest one when each signal processing module has a different process time. Therefore, the system has an unnecessarily increased parallelism relative to the processing capacity.
(c) Data input and data processing each take one frame time, and data input and output each need a 1-frame buffer memory, resulting in a longer time lag and an increased memory capacity. Therefore, the system involves a significant loop delay in feedback control and the like, and it is difficult to realize the coding controller 270 in FIG. 5 for example.
(d) Since the system is intended for a complete parallel processing, it cannot perform such a process as scanning the entirety of a frame horizontally.
FIG. 6 is a block diagram of the conventional digital signal processing system disclosed in the proceeding (No. S10-1) of the 1986 annual convention of the communication department of The Institute of Electronics and Communication Engineers of Japan. In the figure, indicated by 31 is a dual-port internal data memory (will be termed 2P-RAM) capable of reading and writing two sets of data simultaneously, 32 is an address generator which calculates the address of read data or write data, 33 is a data bus used for the internal transfer of data related to computation, 34 and 35 are selectors which select data in the 2P-RAM 31, 36 is a register which holds computation data selected by the selector 34, 35 is a register which holds computation data selected by the selector 35, 38 is a multiplier, 39 is a register which holds the output of the multiplier 38, 40 is a selector which selects the output of the register 36 or accumulators (ACC0-ACC3) 44, 41 is a selector which selects the output of the registers 39 or 37, 42 is an arithmetic/logic unit which performs computations for the outputs of the selectors 40 and 41, and 43 is a selector which selects the output of the arithmetic/logic unit 42 or data in an external data register 46. The accumulators 44 are used to hold the output of the arithmetic/logic unit 42 for cumulative computations. The external data register 46 holds data from an external data memory 47. Indicated by 45 is an external address register which holds address data provided by the address generator 32 and transfers it to the external data memory 47.
Next, the operation will be described. This signal processing system based on a digital signal processor performs command fetching and decoding for the preset microprogram, data reading, computation, and computation result writing, in a parallel pipeline processing mode. The following describes the operation of 3-input-1-output computation.
The arithmetic/logic unit, multiplier, address generator, data memories and selectors are controlled in the microcommand mode.
Arithmetic operations for two inputs, including addition, subtraction, maximum evaluation, minimum evaluation, etc. are expressed generically by a.sym.b, and a multiplication operation for two inputs is expressed generically by a.times.b, where a and b are independent data.
The arithmetic operations and multiplication are combined to form 3-input-1-output operations, and they are defined by the following expressions. EQU Z.sub.i.sup.1 =(ai.sym.bi).times.ci (1) EQU Z.sub.i.sup.2 =(ai.times.bi).sym.ci (2)
where i=1 to N, and ai, bi and ci are sets of independent data stored in the 2P-RAM 31.
FIG. 7 shows the sequence of process steps for implementing the 3-input operation of the form of expression (1) by the digital signal processing system, for example, shown in FIG. 6.
The data address generator 32 sets up the starting addresses for two data sets A and B, and selects the simple incremental mode. Then the two data sets A and B are loaded through the selectors 34 and 35 into the registers 36 and 37. The selectors 40 and 41 select the registers 36 and 37, respectively, so that the arithmetic/logic unit 42 implements the arithmetic operation ai.sym.bi. The selector 43 selects the arithmetic/logic unit 42 to hold the operation result temporarily in one of accumulators (ACC0-ACC3) 44, and the resultant data is sent over the data bus 33 and through the external register 46 and stored in the external memory 47, which addressing mode is the simple incremental mode by being linked to one of the addresses for the 2P-RAM 31 in the address generator 32.
In the subsequent step ST3, the data address generator 32 sets up the starting addresses of the data set C and data set ai.sym.bi, and ci data is read out of the 2P-RAM 31 to the register 36. The selector 35 selects the data bus 33 to load the data of ai.sym.bi from the external memory 47 into the register 37. In this case, in order to have a coincident timing of reading for the data set C and data set ai.sym.bi, step ST4 needs to expend two cycles of useless command reading for the external memory in advance.
The two sets of data are multiplexed by the multiplier 38 in step ST5, and the result is stored in the register 39. In the next cycle, the resultant data is passed through the arithmetic/logic unit 42 and, after being held temporarily in one of the accumulators (ACC0-ACC3) 44, transferred over the data bus 33 to the 2P-RAM 31.
These operations are carried out in parallel on the basis of the pipeline process, and the operations from the reading of 2P-RAM 31 until the storing of the process result in the external memory 47 for N data sets will take N+3 machine cycles in the case of an arithmetic operation.
The steps of operations are listed in the following Table 1 and Table 2. Table 1 lists for the operation of ai.sym.bi and the transfer of the result to the external memory 47, and Table 2 lists the reading of the resultant ai.sym.bi from the external memory 47, the operation of (ai.sym.bi).multidot.ci, and the transfer of the result to the 2P-RAM. In both tables, symbol "x" represents an indefinite value. Storing in the external data register 46 completed in machine cycle N+3 in both tables, and the external data register 46 is read uselessly in machine cycle 0 (two machine cycles) in Table 2.
TABLE 1 __________________________________________________________________________ External data Machine cycle Register 36 Register 37 Register 39 accx register 46 __________________________________________________________________________ 1 a.sub.1 b.sub.1 x x x 2 a.sub.2 b.sub.2 a.sub.1 .times. b.sub.1 a.sub.1 .sym. b.sub.1 x 3 a.sub.3 b.sub.3 a.sub.2 .times. b.sub.2 a.sub.2 .sym. b.sub.2 a.sub.1 .sym. b.sub.1 4 a.sub.4 b.sub.4 a.sub.3 .times. b.sub.3 a.sub.3 .sym. b.sub.3 a.sub.2 .sym. b.sub.2 . . . . . . . . . . . . . . . . . . N a.sub.N b.sub.N a.sub.N-1 .times. b.sub.N-1 a.sub.N-1 .sym. b.sub.N-1 a.sub.N-1 .sym. b.sub.N-2 N + 1 x x a.sub.N .times. b.sub.N a.sub.N .sym. b.sub.N a.sub.N-1 .sym. b.sub.N-1 N + 2 x x x x a.sub.N .sym. b.sub.N N + 3 x x x x x __________________________________________________________________________
TABLE 2 __________________________________________________________________________ External data Machine cycle Register 36 Register 37 Register 39 a c c x register 46 __________________________________________________________________________ 0 x x x x a.sub.1 .sym. b.sub.1 1 a.sub.1 .sym. b.sub.1 c.sub.1 x ##STR1## a.sub.2 .sym. b.sub.2 2 a.sub.2 .sym. b.sub.2 c.sub.2 (a.sub.1 .sym. b.sub.1) .times. c.sub.1 ##STR2## a.sub.3 .sym. b.sub.3 3 a.sub.3 .sym. b.sub.3 c.sub.3 (a.sub.2 .sym. b.sub.2) .times. c.sub.2 ##STR3## a.sub.4 .sym. b.sub.4 ##STR4## ##STR5## ##STR6## ##STR7## ##STR8## ##STR9## N a.sub.N .sym. b.sub.N c.sub.N (a.sub.N-1 .sym. b.sub.N-1) .times. c.sub.N-1 x N + 1 x x (a.sub.N .sym. b.sub.N) .times. c.sub.N ##STR10## x N + 2 x x x ##STR11## x N + 3 x x x x x __________________________________________________________________________
Next, after two useless reading cycles of the external memory 47 for timing purposes, multiplication is carried out for N data sets and the results are stored in the 2P-RAM 31. These operations take N+3 machine cycles, which are increased by two command cycles for address initialization, and a total of 2N+10 cycles are expended. An operation of expression (2) also takes 2N+10 cycles. Accordingly, it will be appreciated that if a 3-input-1-output operation is conducted for N data sets using a processor with the ability of 2-input operation at most, it will take about 2N machine cycles (provided that N is sufficiently large).
The following describes the cumulative operation for the results of the foregoing 3-input-1-output computation. ##EQU1## In the case of expression (3), the multiplication result for ai.sym.bi and ci (output of register 39) and the intermediate cumulative value are entered to the arithmetic/logic unit 42, and the result of summation is entered back to the same accumulator 44 through the selector 43. Thereby, the process takes 2N+10 cycles unchanged.
In the case of expression (4), the data sets (ai.times.bi).sym.ci which have been stored temporarily in the 2P-RAM 31 are read out sequentially and summed by the arithmetic/logic unit 42, and therefore the process needs another N cycles, resulting in a total of 3N+10 cycles.
The conventional digital signal processing system is formed as described above, and therefore for a 3-input-1-output operation of three independent data sets, it performs two 2-input-1-output operations. In addition, the process time is further extended for address control, memory transfer and other processes.
FIG. 8 is a diagram showing in brief the image coding transmitter which implements the conventional motion compensatory operation method disclosed in an article entitled "Dynamic Multistage Vector Quantization for Images", journal of The Institute of Electronics and Communication Engineers of Japan, Vol. J68-B, No. 1, pp. 68-76, Jan. 1985. In the figure, indicated by 51 is an input signal of image data formed of a plurality of consecutive frames on the time axis, 52 is a motion compensator which produces a prediction signal on the basis of the resemblance computation of correlation between the current frame represented by the input signal 51 and the previous frame represented by a previous frame signal 53 which is the previous reduced signal 51, 54 is motion vector information provided by the motion compensator 52 indicative of the position of a prediction signal block, 55 is a prediction signal produced by the motion compensator 52, 56 is a coder which codes the difference between the input signal 51 and the prediction signal 56 to form a coded difference signal, 57 is a decoder which decodes the signal coded by the coder 56, and 58 is a frame memory which stores data reproduced through the summation of the difference signal from the decoder 57 and the prediction signal from the motion compensator 52.
The performance of the foregoing arrangement will be described in connection with FIG. 9. The motion compensation process calculates for the input signal 51 the amount of distortion between an 11-by-12 block located in a specific position in the current frame shown in FIG. 9(A) and M blocks in the search range S in the previous frame shown in FIG. 9(B) to evaluate the position of the block y providing a minimal distortion relative to the position of the input block, i.e., motion vector V, and to recognize the signal of the minimal distortion vector block as a prediction signal.
The number of motion vectors V under search within the search range S in the given frame is assumed to be M (an integer greater than 1). The amount of distortion of the position of a specific motion vector V between the previous frame blocks and the current input block is calculated as a sum of absolute values of differences as follows. ##EQU2##
where input vectors x={x1, x2, . . . , xk}, search object blocks yi={yi1, yi2, . . . , yik}, i=1, 2, . . . , M, and M and K are fixed values. The motion vector V is evaluated as follows. EQU V=Vi{min di.vertline.i=1, 2, . . . , M} (6)
FIG. 10 shows the sequence of operations for detecting the motion vector V. Step ST11 calculates a distortion di at each of K sampling points on the basis of expression (5), and the next step ST12 compares the di with the minimal distortion D at position I, and, if di&lt;D, the variables are replaced to be D=di and I=i. These operations are repeated for the number of search vectors, i.e., the operational process of expression (6), to determine the final minimal distortion D and its position I.
These operations must be completed within the period of each frame entered successively, and therefore a high-speed digital signal processor is required.
As an example, the digital signal processing system shown in FIG. 6 is used to carry out the motion compensation process. In this case, the multiplication-sum operation takes place K.times.M times for each input block, and the number of machine cycles is the total time expended by M times of processes including comparison and updating. Generally, the number of cycles for comparison and updating is small enough as compared with that of the multiplication-sum operation, and the volume of motion compensation operation for one block is virtually equal to K.times.M machine cycles.
However, since these operations are determined from the time corresponding to the period of frames entered successively, parallel processing will be needed for the mass multiplication-sum operations to be performed in a short time, depending on the operation process cycle time of a particular digital signal processor.
The conventional motion compensation scheme is implemented as described above, and in order to ensure the operation time for an enormous volume of operations when carried out using a digital signal processor, the processor needs to have parallel processings, resulting in an increased complexity and scale of hardware structure.