In recent years, a large number of digital signal processors (DSP) for image codex have been proposed based on image compression and encoding/expansion and decoding standards such as the CCITT H. 261 recommendation, MPEG, or the like.
Among these DSP's, the present invention relates to a DSP of a "single instruction stream.multidot.multiple data stream (SIMD)" control system which has a plurality of processing units each comprising an arithmetic and logic unit, multiplier, accumulator, etc., wherein these processing units perform parallel processing on a plurality of data by a single instruction flow, as disclosed in Yamauchi et al, "Architecture and Implementation of a Highly Parallel Single-Chip Video DSP", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 2, NO. 2, JUNE 1992, pp. 207-220.
The configuration disclosed in this reference is shown in FIG. 1. The processing unit of this DSP can connect processors in a pipeline and also performs pipeline processing of computations.
A simple explanation will be made first of a principle of the computation pipeline.
FIG. 2 shows an example of the configuration of the computation pipeline. This computation pipeline is one in which two inputs X and Y are added at an arithmetic and logic unit (ALU) A1, the result of addition and the coefficient from a coefficient memory A3 are multiplied at a multiplier A2, and the result of that multiplication is then accumulated at an accumulator A3. Continuous performance of a chain of such computations with respect to a plurality of data is called "computation pipeline processing".
FIG. 3 is a graph showing a timing chart of the processing in the computation pipeline of FIG. 2. For simplification, it is assumed that the processors A1, A2, and A4 of the computation pipeline complete the computation in one clock cycle.
The "unit of processing" in FIG. 3 means a set (X, Y) of the data input to a two-input terminal.
As shown in FIG. 3, when looking for example at the i-th unit of processing,
in the (k-1)-th clock cycle, the ALU (A1) performs addition processing; PA1 in the k-th clock cycle, the multiplier A2 performs multiplication processing; PA1 in the (k+1)-th clock cycle, the accumulator A4 performs accumulation processing. Also, when looking at the k-th clock cycle, PA1 the (i-1)-th unit of processing after the addition processing and the multiplication processing is accumulated at the accumulator A4, PA1 the i-th unit of processing after the addition is multiplied at the multiplier A2, and PA1 the (i+1)-th unit of processing is added at the adder A1. PA1 (1) discrete cosine transformation/inverse discrete cosine transformation (DCT/IDCT); PA1 (2) quantization/inverse quantization; PA1 (3) moving picture detection; PA1 (4) motion compensation (production of virtual pixel, production of predictive pixel); PA1 (5) filter (computation of inner product); and PA1 (6) addition of images, subtraction of images; and so on in image CODEC processing. PA1 said adaptive video signal processing apparatus characterized in that it is provided with a plurality of processing units provided in parallel, each of which having an extended arithmetic and logic unit performing addition, subtraction, various logical computations, comparisons of magnitude, computation of absolute values of differences, and butterfly addition and subtraction processing; a first internal pipeline memory provided at a stage after the extended arithmetic and logic unit, a multiplier unit provided at a stage after the first internal pipeline memory, a coefficient memory supplying a coefficient to the multiplier unit, a second internal pipeline memory provided at a stage after it in the multiplier unit, an accumulation processing unit provided at a stage after the second internal pipeline memory, and a third internal pipeline memory provided at a stage after it in the accumulation processing unit; PA1 mutually connected pipeline memories disposed so as to connect adjoining processing units among these plurality of parallel processing units; and PA1 data selectors which selectively apply the input data to the aforesaid plurality of processing units, wherein PA1 adjoining processing units are coupled via the aforesaid mutually connected pipeline memories and, the internal pipeline memories in the aforesaid processing units are selected to constitute a predetermined data flow path, PA1 thereby to perform a desired video signal processing. PA1 (a) the aforesaid discrete cosine transformation processing data is input to the multiplier units in all processing units and the results of multiplication are accumulated at the accumulation unit, PA1 (b) the output is input to the extended arithmetic and logic units in the plurality of processing units excluding the aforesaid initial stage processing unit and the results of processing in the extended arithmetic and logic units are output to the adjoining mutually connected pipeline memories described before. PA1 a positive/negative inverter which inverts the polarity of a first input data; PA1 a first data selector which is provided at a stage after the positive/negative inverter and selectively outputs the aforesaid first input data or the aforesaid polarity-inverted first data; PA1 an adder adding the selected output data of the first data selector and a second input data; PA1 a subtracter which subtracts the aforesaid second input data from the aforesaid first input data; PA1 a logical processor which performs the logical processing of the aforesaid first input data and the aforesaid second data such as a logical OR, logical AND, exclusive logical OR, negation, etc.; PA1 a positive/negative decision unit receiving as its input the output of the aforesaid adder and the aforesaid subtracter and performing the positive/negative decision; PA1 a second data selector receiving as its inputs the outputs of the aforesaid adder, the aforesaid subtracter, and the aforesaid positive/negative decision unit and selectively outputting the same; and PA1 a first output terminal connected to the second data selector; PA1 a second output terminal connected to the aforesaid subtracter, and PA1 any of addition, subtraction, various types of logical computations, comparisons of magnitude, computation of absolute values of differences, and butterfly addition and subtraction processing being carried out by combining the above-mentioned circuits.
By repeatedly performing such an operation with respect to a plurality of units of processing, the computation pipeline processing can be realized.
Next, an explanation will be made of a prior art.
Here, a DSP of the "single instruction stream.multidot.multiple data stream (SIMD)" control system which has been proposed in the above-mentioned reference, in which four sets of processing units perform parallel processing on a plurality of data by a single instruction flow will be considered.
As a prerequisite, it is assumed that each processing unit is comprised of three types of processors, that is, an arithmetic and logic unit (ALU) performing the addition, subtraction, and the logical computation, a multiplier, and an accumulator. Also, for ease of the explanation, it is assumed that each processor completes the computation in one clock cycle. Accordingly, this DSP can execute 12 computations (for example, four addition, four multiplication, and four accumulation operations) at the maximum in one clock cycle. Further, it is assumed that this DSP has a data memory for supplying the data to the processors or storing the data from the processors inside a chip or outside the chip.
First, the configuration for realizing the computation pipeline having the highest degree of freedom will be explained based on the above-described prerequisites.
As shown in FIGS. 4A to 4D, the computation pipeline having the highest degree of freedom can be realized by regarding the data memory as a pipeline register and performing the computation pipeline processing by software (called software pipelining). At this time, the processing units are connected only via the data memory. Note that, FIGS. 4A to 4D show the operation modes of four parallel processing units. Accordingly, the data memory must be able to supply arbitrary data to the inputs of all processors at every clock cycle and simultaneously store the data of outputs from all processors at arbitrary addresses.
As the number of ports of the data memory, as seen from the illustration of FIGS. 4A to 4D, 16 ports are necessary for the input to the processors and 12 ports are necessary for the output from the processors. Accordingly, a multiport memory of a total of 28 ports is necessary. This number of ports is not realistic when considered in light of the current semiconductor circuit technology and its realization is actually difficult.
Therefore, a procedure for dividing the data memory into banks and reducing the number of ports per bank can be considered. However, even if a data memory is divided into four banks, for example, in the above-described example, a multiport memory of seven ports per bank would be still be necessary.
Accordingly, it is possible to adopt an approach of restricting the degree of freedom of the computation pipeline to a certain extent in accordance with the application program and thus reducing the number of ports of the data memory. For example, as proposed in the above-mentioned reference, four computation pipelines comprising an ALU, multiplier, and accumulator are provided, and only the inputs and outputs of the computation pipelines are connected to the data memory. The number of ports required for the data memory in this case becomes eight ports for the input to the computation pipeline and four ports for the output from the computation pipeline.
However, in the configuration of the computation pipeline of the conventional DSP mentioned above, there is a restriction in the degree of freedom of the computation pipeline. For example, for processing for logical computation after multiplication, this pipeline computation cannot be carried out. In this case, first the pipeline processing of multiplication is carried out with respect to all data using the multiplier and then the pipeline processing of the logical computation is carried out with respect to all data after the multiplication using the ALU. Accordingly, the ALU is not used at the time of multiplication, and the multiplier is not used at the time of the logical computation, and therefore the efficiency of use of the multiplier is lowered and a reduction of performance is induced. Also, since the computation pipeline processing is carried out divided into two steps, it becomes necessary to perform the initialization two times at the time of starting the computation pipeline.
Further, in the above-mentioned conventional DSP, it is necessary to store an intermediate result at the point of time of completion of the first computation pipeline processing, and therefore a larger capacity of the data memory is required.
In the element processing of image CODEC, other than the processing for the logical computation after multiplication as in the above-described example, processing for continuously performing multiplication, processing for adding the results of multiplication to each other, and so on become necessary. A problem similar the above-mentioned problem occurs in each of these processing operation.
Also, with the computation pipeline configuration of the above-mentioned conventional DSP, the configuration of the computation pipeline of the butterfly computation (addition and subtraction) and the multiplication and addition in a high speed computation algorithm as proposed in Japanese Patent Application No. 4-338183 "two-dimensional 8.times.8 discrete cosine transformation circuit and two-dimensional 8.times.8 inverse discrete cosine transformation circuit" by the present assignee cannot be realized.
In this preceding patent application, when the two-dimensional 8.times.8 discrete cosine transformation or the two-dimensional 8.times.8 inverse discrete cosine transformation are carried out, matrix decomposition is supplied and computation processing is carried out. A detailed description will be given later referring to FIG. 9 and FIG. 10.
The reason why the computation pipeline cannot be constituted as described above is that the multiplication and addition cannot be carried out in parallel when performing the butterfly computation (two processing units are used in the conventional example) due to the limitation of the number of ports of the data memory. Accordingly, the butterfly computation and the multiplication and addition are sequentially executed, and therefore the performance is considerably lowered in comparison with an ideal computation pipeline configuration as proposed in the above-described patent application.