1. Field of the Invention
The present invention relates to a digital signal processor (DSP) for executing arithmetic operations for obtaining sums of products at a high speed, and particularly to a DSP which can independently process picture-image signals.
2. Description of the Prior Art
Recently, digital signal processors (hereinafter, designated as DSP) for executing high speed arithmetic operations on sums of products frequently appearing in digital signal processes have been widely used. Moreover, also in the field of processing picture-image signals, DSP's for independently processing image signals have prevailed.
FIG. 1 shows a block diagram of a typical construction of a conventional DSP 371 for processing picture-image signals.
In the DSP 371, there are provided mainly an operation section 373 for carrying out data operations, address generation units (hereinafter, designated as AGU) 375a to 375c, a register file 381 for temporarily holding data, an instruction memory 377 for holding instructions in a readable state, and a control section 379.
In such construction, the control section 379 controls the AGU's 375a to 375c, operation section 373 and register 381 in accordance with an instruction contained in the instruction memory 377. The AGU's 375a to 375c calculate addresses of external memories 383a to 383c in accordance with a process program, and then output them to these memories 383a to 383c, respectively. Then, these calculated addresses stored in the external memories 383a to 383c, and are respectively read and temporarily held in the register 381 through a data bus 384. Otherwise, these data are inputted to the operation section 373, and some calculation designated by a suitable program is given to these data in the section 373. Moreover, these data calculated in the operation section 373 are respectively read into the external memories 383a to 383c in accordance with the respective addresses generated by the AGU's 375a to 375c.
Incidentally, the address calculation carried out in the AGU's 375a to 375c and the arithmetic operation given to the data in the operation section 373 are executed in parallel or in a pipeline.
Next, the construction, operation and problems in the address generation units (AGU) in the above-mentioned conventional DSP for processing picture-image signals will be explained.
First, the construction of the AGU's 375a to 375c is explained with reference to FIG. 2. Moreover, in FIG. 3, there is schematically shown an instruction field in which is described the instruction on data operation or on address calculation.
Now consider an addressing mode, i.e., a two-dimensional raster scan mode in which a scanning operation is carried out from the left upper portion to the right lower portion of a screen, as shown in FIG. 4.
In such an addressing mode, increment of the X address by means of an adder 385 and increment of the Y address by means of an adder 387 are carried out simultaneously. Then, the increase caused by each address increment is inputted to a register 389 or 391.
On the other hand, an addressing instruction is described in an AGU control field 791 in the instruction field shown in FIG. 3. Moreover, in the AGU control field 791 are described step by step input instructions to the registers 381, 391, instructions on additive operations to the adders 385, 387 and instructions on transfer of the addition results. Besides, the description on these instructions is so given as to synchronize with timings of the increment on the X and Y addresses and of the data operation by the operation section 373. Additionally, since the operation section 373 and the AGU's 375a to 375c are operated in parallel, the AGU control field 791 and an operation section control field 793 are set separately in the instruction field of FIG. 3.
As stated above, in the conventional DSP for processing picture-image signals, the instructions for controlling the AGU's 375a to 375c are described step by step in a condensed form in the AGU control field 791. Therefore, the bit number of these instructions must be greatly increased, so that a large amount of hardware for reading, storing and decoding these instructions must be prepared in the DSP. Moreover, when the instruction bit number is increased, a large number of instruction memories is required, and it also becomes necessary to decode these instructions into a plurality of parts processable. As the result, far more time is required for reading the instructions. Besides, it is very difficult to prepare such a program which is capable of realizing synchronism with timings of respective data operations. Moreover, even though possible, it would take much time and trouble to prepare such a program.
Next, the address production method in the address generation units (AGU) of the conventional DSP is explained.
Generally, the discrete Fourier transform technology is used in processing digital signals on picture-image processing or signal analysis. As a method for carrying out such a discrete Fourier transform, the so-called FFT (Fast Fourier Transform) method is known. Incidentally, the FFT method is disclosed in literatures, such as Alan V. Oppenhelm et al, Digital Signal Processing, Prentice-Hall International Editions. Such an FFT method is realized by repetition of butterfly operations on two data.
For example, when two complex-number input data are A+iB, C+iD, and the respective results of the butterfly operation are A'+iB', C'+iD', and the coefficient is X+iY, the butterfly operation can be expressed as follows: EQU A'+iB'=(A+iB)+(X+iY)(C+iD) EQU C'+iD'=(A+iB)-(X+iY)(C+iD)
wherein A, C, A', C' and X are real numbers, and B, D, B', D' and Y are imaginary numbers, respectively.
Moreover, as a typical FFT method, there can be mentioned a time thinning-out FFT method and a frequency thinning-out FFT method.
Here, a calculation procedure of the time thinning-out FFT method in the conventional data address generation method is explained.
FIG. 5 is a flow chart to show a calculation procedure of an FFT method for thinning out time at eight points.
In the same drawing, numerals 0, 1, 2, . . . 7 respectively designate outputs or addresses generated from an address generation circuit. Moreover, x(0) to x(7) designate input data, X(0) to X(7) designate output data, and F0(0) to F2(7) are intermediate results, further W0 to W7 are coefficients, respectively.
On the 0th stage, addresses 0, 1, 2, 3, 4, 5, 6 and 7 are successively generated, and expressed in their respective binary forms, then each result is subjected to a bit reverse order operation. Namely,
0(000) is transformed to 0(000), PA0 0(001) to 4(100), PA0 2(010) to 2(010), PA0 3(011) to 6(110), PA0 4(100) to 1(001), PA0 5(101) to 5(101), PA0 6(110) to 3(011), and PA0 7(111) to 7(111). PA0 (1) the process E once; PA0 (2) the process A (n-2) times; PA0 (3) the process F once; PA0 (4) a cycle, (m-2) times, of executing: PA0 (5) the process G once; PA0 (6) the process C (n-2) times; and PA0 (7) the process H once.
Then, each data (x(i) in FIG. 5) is read from each resultant transformed address, thereafter a butterfly operation is carried out with respect to respective pairs of addresses, i.e., addresses 0 and 4, 2 and 6, 1 and 5, and 3 and 7. Then, the results (F0 (i) in FIG. 5) are written at the addresses 0, 1, 2, 3, 4, 5, 6 and 7. For example, F0(0)=x(0)+W0.times.x(4).
Accordingly, each address to be generated on the 0th stage can be expressed by 2.sup.1 k+j+2.degree.i, wherein i=0, j=0, k=0, 1, 2, 3, and nesting is carried out in the order of i, j, k and is ended when i, j, k reach the respective maximum values.
Then, the butterfly operation is carried out with respect to every pair of the addresses generated in such a manner. However, the addresses generated here are not yet subjected to the bit reverse order operation as mentioned above.
Incidentally, the coefficients W are all equal on the 0th stage, so that a read address of the coefficient memory is 0.
Next, on the 1st stage, addresses 0, 2, 1, 3, 4, 6, 5 and 7 are generated successively, and each butterfly operation on F0(i) is given to respective address pairs of 0 and 2, 1 and 3, 4 and 6, and 5 and 7. Then, the results (F1(i) in FIG. 5) are respectively written in the addresses 0, 2, 1, 3, 4, 6, 5, 7.
At the time, each address to be generated can be expressed by 2.sup.2 k+j+2.sup.1 i wherein i=0, 1, j=0, k=0, 1.
Thereafter, each pair of the addresses generated in accordance with the above formula is subjected to a butterfly operation.
Incidentally, read addresses of the coefficient memory are 0 and 2, so that coefficients corresponding to these addresses are alternately used.
Moreover, on the 2nd stage, the addresses are generated in the order of 0, 4, 1, 5, 2, 6, 3 and 7, then each butterfly operation on Fi(1) is given to respective address pairs of 0 and 4, 1 and 5, 2 and 6, and 3 and 7. Further, the results (F2(i) in FIG. 5) are respectively written in the addresses 0, 4, 1, 5, 2, 6, 3, 7.
In this case, each address to be generated can be expressed by 2.sup.3 k+j+2.sup.2 i wherein i=0, 1, j=0, 1, 2, 3, k=0.
Thereafter, each pair of the generated addresses is subjected to a butterfly operation.
Incidentally, read addresses of the coefficient memory are 0, 1, 2, 3.
Generally, read addresses of the input data memory and write addresses of the output data memory respectively generated on the n-th stage of an FFT at a 2.sup.m point can be expressed by the following formula (1): EQU 2.sup.+1 k+j+2.sup.n i (1)
wherein i=0, 1, j=0, 1, 2, . . . , 2.sup.n-1, k=0, 1, 2, . . . 2.sup.m-n-1 -1.
On the other hand, read addresses of the coefficient memory can be expressed by the following formula (2): EQU 2.sup.m-n-1 h (2)
wherein h=0, 1, 2, . . . , 2.sup.n -1.
The generation period of h is twice as long as the generation period of i, and is repeated 2.sup.m-n-1 times in the interval in which h reaches 2.sup.n -1 from 0.
In order to realize a hardware system for generating such addresses, counters for successively generating i, j, k, h, shifters for multiplying each corresponding factor by 2.sup.n+1, 2.sup.n, 2.sup.m-n-1, adders for obtaining the sum of 2.sup.n+1 k, j and 2.sup.n i, and a bit reverse order operation circuit for carrying out the bit reverse order operation on the 0th stage must be prepared respectively for the input data, output data and coefficients.
FIG. 6 shows a conventional example of address generating circuits for realizing the formula (1). In the same drawing, reference numerals 399,392. 393 respectively designate counters, 394, 395 are shifters, 396, 397 are adders and 398 denotes a bit reverse order operation circuit.
Namely, as clearly seen from FIG. 6, in such an address generation circuit for realizing the address generation method based on the conventional FFT, the plurality of counters, shifters and adders, and the bit reverse order operation circuit are required, so that the generating circuit must be formed in relatively large-scaled and complicated construction. Moreover, since the connection between these counters and the setting of initial and final values of each counter are very complex, it is very difficult to control the circuit in such construction.
Next, another address generation method based on the affine transformation is explained.
Generally, in the picture-image signal process, affine transformation concerning reduction, expansion, movement or rotation of picture images is frequently used. Since the affine transformation is generally regarded as transformation concerning coordinates, in the case of processing digital signals of picture images, it is used for transformation of coordinates of the input and output picture images. Therefore, the address generation based on this method can be designated by the following formula (1): ##EQU1## wherein X, Y are input picture-image coordinates, x, y are output picture-image coordinates, a, b, c, d are transformation coefficients, and e, f are offset coordinates, respectively.
Moreover, the input picture-image coordinates to be obtained by the calculation are respectively in the following forms (2): EQU X=ax+by+e Y=cx+dy+f (2)
Namely, these input picture-image coordinates X, Y can be respectively obtained by an addition based on a, b, e and another on c, d, f with such general screen scanning that in the output picture-image coordinates (x, y), x is increased by one from the most left upper coordinates (0, 0) to right lower coordinates (m, n), and if x reaches m then y is increased by one.
In this case, though a, b, c, d, e, f are real numbers respectively, the input and output picture-image coordinates should be coordinates of integral numbers. Therefore, it is necessary to obtain coordinates of integeral numbers from coordinates of real numbers. For example, when both of the coordinates X, Y obtained by the formula (2) are real numbers, and both of the integral parts in X, Y are Xr, Yr, four integral coordinates adjacent to X, Y are generally determined as follows: EQU (Xr, Yr), (Xr+1, Yr) (Xr, Yr+1) (Xr+1, Yr+1). (3)
Accordingly, the input and output picture-image coordinates are selected as those nearest to the real-number coordinates (X, Y) among these four coordinates. Therefore, it is necessary to calculate the respective distances from (X, Y) to these four coordinates.
FIG. 7 shows a typical example of circuit construction for address generation. Namely, in the same drawing, the respective increased amounts of coordinates (X, Y) which are calculated by the operation section 373 shown in (FIG. 1) are contained in registers 301, 302, and the respective initial values of the same coordinates are contained in registers 303, 304, 305, 306. Then, the coordinates X, Y are respectively counted up by ALU's 313, 314 respectively controlled by the control section 379. Thereafter, the coordinates X, Y newly obtained by the ALU's 313, 314 are respectively shifted by barrel shifters 307, 308 to be met an address format given in advance to a picture-image memory. Then, the coordinates X, Y are combined together and transformed into an address on the picture-image memory by an ALU 309, and contained in an output register 312.
To carry out affine transformation by such an address generating circuit, the above-mentioned factor e calculated by the data processing section is contained in the register 304, likewise, f in 306, a in 301, and c in 302. Moreover, addition is repeated with respect to a, c based on e, f as the initial values. If the number of times of the addition reaches the maximum value with respect to the coordinate x, then b+e calculated by the data processing section is contained in the register 304, and d+f in 306. Thereafter, the calculation is continued in the same manner. As the result, a coordinates X, Y are transformed into an address on the picture-image memory by the barrel shifters 307, 308 and ALU 309, and contained in the output register 312. Since both of the coordinates contained in the register 312 are real numbers, these are transferred to the data processing section so as to obtain corresponding integral coordinates.
As stated above, in such an address generation circuit, since it is not possible to calculate the respective distances from real-number coordinates to four adjacent integral coordinates, an operation section is usually used in combination. Therefore, it takes much time to obtain an address based on the affine transformation. Moreover, in such construction, it is necessary to operate the ALU's 313, 314, selectors 310, 311 and barrel shifters 307, 308 at the same time. Accordingly, the program must be complicated, and an extremely large capacity is required for the program memory. Therefore, it is very difficult to realize with one LSI a digital signal processor capable of performing the affine transformation as well as other addressing modes.
Namely, in the conventional and generally known address generation circuit, it takes much time to obtain an address based on the affine transformation, moreover the program description must be extremely complicated.
Next, there is explained a conventional example of address generation circuits for carrying out eight neighborhood access.
On the other hand, the picture-image processing technology is also widely used in the field of factory automation or medical engineering. As the picture-image processing technology, there can be mentioned smoothing or edge detection based on two-dimensional space filters, orthogonal transformation represented by fast Fourier transformation (FFT) or discrete cosine transformation (DCT), and feature extraction represented by labelling, line thinning or boundary tracking.
For example, in the case of the two-dimensional filter technology, with respect to one picture element as an object, the process is given to a space of three picture-element square around the object. In this case, the process is continued in a scanning manner with shifting the object by one picture element over all of the screen.
In case of the labelling or line thinning, the process is given to four or eight points adjacent to one picture element as an object to be processed. The definition of these four or eight adjacent points is given in FIG. 8. Namely, in the same drawing, x1, x2, x3, x4, x5, x6, x7, x8 designate such eight adjacent points. On the other hand, x1, x3, x5, x7 show the four adjacent points. Moreover, at the lower portion in each section are respectively designated each X address and Y address corresponding to a distance from the center x0 to each point in the section. Accordingly, the process given to eight adjacent points in the labelling or line-thinning process is the same in principle as that given to a space of three picture-element square. However, unlike the above-mentioned two-dimensional space filter process, the process in the labelling or line- thinning technology is given to eight points adjacent to a specified picture element.
Hereinafter, the process on such eight adjacent points in the labelling is explained in more detail.
First, the definition of a term "connection" is explained. With respect to a picture element designated by 1 in binary picture images, if there is a point designated by 1 in the adjacent eight point, these two picture elements are defined to be in the relation of "connection" to each other in the eight-adjacent-point system. On the other hand, if there is a point in the four adjacent points, these two are defined to be in the relation of "connection" to each other in the four-adjacent-point system. Moreover, among a collection of picture elements respectively being in such a connective relation in binary picture images, some are in the 0 state and the other in the 1 state. These classes of the 0 and 1 states are respectively called connective components. Namely, in the labelling of these connective components, the same labels are allocated to the same connective components, and the other labels to the other components.
In this case, raster scan is successively carried out on the picture images so as to find a picture element in the 1 state (hereinafter designated by picture element 1), which is not yet given labels. Then, to this picture element is allocated a label not yet used. Next, picture elements 1 among eight picture elements which are adjacent to this picture element are given the same labels.
Moreover, the same labels are given to picture elements 1 among respective eight picture elements adjacent to these labelled picture elements. Then, the same process is repeated until no more picture element 1 to be labelled anew is found. As a result, the same labels are given to all of the same connective components. Furthermore, the same scanning process is repeated to find a picture element 1 not yet labelled so as to allocate a new label to it. Then, the same process is carried out. Thereafter, the algorithm is completed when the raster scan is finished over all of picture images. Accordingly, in such labelling, access of picture elements of the eight adjacent points is essential, thus the address generation for the access is very important.
Though the access order to picture elements of the eight adjacent points is optional in the labelling, this order is one of very important factors in the line thinning or boundary tracking.
Next, the calculation of a connectivity number necessary for the line thinning is explained.
The connectivity number Nc is a number for expressing a feature of a picture element in a certain Figure, and is defined by the following formula:
(In case of four adjacent points) ##EQU2##
(In case of eight adjacent points) ##EQU3##
In these formulae, f(xi) is a value of a picture element at xi shown in FIG. 8, and 'f(xi)=1-f(xi), x9=x1, S={1, 3, 5, 7}.
Namely, Nc=0 means an isolated point or internal point, Nc=1 means an end point, Nc=2 means a serial point, Nc=3 means a branch point, and Nc=4 means a crossing point.
As clearly seen from the result, the calculation of the connectivity number can be carried out with ease when the access is carried out in the anti-clockwise direction with respect to picture elements of the four or eight adjacent points. Next, the boundary tracking is explained. On extraction of the boundary of a Figure, it is usually convenient to take out the boundary as a series of picture elements arranged in a suitable order as well as a collection of boundary points. Moreover, there have been proposed an algorithm for successively tracking boundary points for such purpose.
According to the algorithm, it necessary to perform such boundary tracking as to carry out access in the anti-clockwise direction with respect to eight adjacent points from the previous picture element around the current picture element.
As a conventional example of address generation circuits for carrying out such eight neighborhood access, the circuit shown in FIG. 2 is known. Returning now to FIG. 2, in this case, reference numerals 401, 389, 403, 391 designate registers respectively capable of increment and decrement. 385, 387, 405 are adders. Moreover, the XA register 401 and YA register 403 respectively generate two dimensional logic addresses X, Y. While, the XB register 389 and YB register 391 respectivey generate relative distances .DELTA.X, .DELTA.Y with respect to adjacent points. Namely, in case of generating addresses of a point, for example (1) in the eight adjacent points shown in FIG. 8, addresses of a picture element at the center is set in the XA and YA registers, and the relative distances -1, 1 are respectively set in the XB and YB registers. Thereafter, the values of the XA and XB registers are added in the adder 385 to be a logic address of the coordinate X of the point (1). On the other hand, the values of the YA and YB registers are added by the adder 387 to be a logic address of the coodinate Y of the point (1). Subsequently, the logic address X is shifted by a shifter 407 so as not to overlap the logic address Y, then added to the logic address Y by the adder 405.
In case of accessing the eight adjacent points from the point (1), the values of the XA and YA registers are set at constant values respectively. However, the values of the YB and XB registers should be respectively changed as follows: EQU -1, -1, -1, 0, 1, 1, 1, 0 1, 0, -1, -1, -1, 0, 1, 1
In the conventional address generating circuits, these numerical sequences are generated by loading them from a program provided inside or outside each circuit, and a suitable instruction for setting address registers is executed for each operational instruction. Therefore, the program must be very long and complicated, so that it such that it will be very difficult to develop such programs.
Accordingly, in the conventional address generation circuit based on the above-mentioned eight-adjacent-point access method, it is necessary to irregularly change the values to be set in the respective registers in the address generation circuit. Therefore, it is necessary to prepare a very complicated program for controlling this operation.
Next, the construction, control operation and the problems thereof in the operation section of the conventional DSP for picture image signal process are explained.
As shown in FIG. 9, the parallel arithmetic operation section based on the conventional parallel execution method comprises a plurality of computing units 501 to 502 and memories corresponding to a plurality of banks so that these units 501 to 502 and memory banks are optionally connected to one another. Moreover, the connection between each operation carried out at these computing units 501 to 502 and each memory bank is controlled in parallel so as to execute a plurality of operations in parallel on the same cycle.
On the other hand, as one of the most important operations executed by the parallel arithmetic operation section, there is a condition branching operation to be carried out on executing a condition branching instruction. In such an operation, when a branching condition designated by the condition branching instruction and a flag register 503 are in a specific relation, a value contained in an instruction pointer 504 is newly written into an address designated in the same condition branching instruction so as to change it into an address designated by the condition branching instruction. Incidentally, the flag register 503 is rewritten by the output result of the computing unit 502, and is generally constructed with a plurality of bits. For example, the register 503 has a bit to be changed into logic 1 when the output of the computing unit 502 is 0, or a bit to be changed into logic 1 when the output of the computing unit 502 is negative.
Moreover, in the conventional parallel arithmetic operation section as shown in FIG. 9, though the plurality of computing units 501to 502 are provided, there is only one flag register 503 belonging to the computing unit 502. Therefore, the setting of branching conditions on the condition branching is greatly limited.
For example, consider a case of operation shown by a flow chart in FIG. 10, where:
a process 1 is executed when a variable A is smaller than a constant C (A&lt;C), and a variable B is equal to a constant D (B=D); and
a process 2 is executed in the other cases.
In such a case, the contents in an instruction memory of this case are shownin FIG. 11.
As shown in FIG. 11, an instruction on B-C is set in an address given increment by one step as compared with an address in which is set an instruction on A-C. Moreover, an execution instruction on the process 1 is set at an address given increment by one step as compared with the address in which is set the instruction on B-C. Furthermore, an execution instruction on the process 2 is set in a branched address.
On the other hand, to realize such an operation, the instruction on A-C is executed by using the computing unit 502, and the bit of the flag register 503 is rewritten in accordance with the sign of the output of execution. Then, at a control circuit 505 is carried out discrimination on whether A&lt;C is established from the bit value of the flag register 503 showing the result of A-C. If A&lt;C is not established, the instruction pointer 504 is rewritten at the branched address in which is set the execution instruction on the process 2 so that the condition branching is established. As the result, the process 2 is executed when the relation A&lt;C is not established. On the contrary, if A&lt;C is established, the instruction pointer 504 is subjected to increment by one step so that the condition branching is not established. In this case, since the instruction on B-C is already set in the address to which is shifted the instruction pointer 504 by one step increment, the same instruction is executed again by the computing unit 502.
Moreover, the bit of the flag register 503 is rewritten again in accordance with the sign of the execution result, and the control circuit 505 discriminates whether B=C is established from the bit value. If B=C is not established, the flow is branched to the branched address, and the process 2 is executed so that the condition branching is established. On the contrary, if B=C is established, the instruction pointer 504 is shifted by one step increment. As the result, the process 1 is executed so that the condition branching is not established.
Namely, by executing the above-described procedure, the process flow shown in FIG. 10 is realized.
In such a manner, according to the conventional parallel arithmetic operation section, a plurality of condition branching instructions are successively executed by one computing unit 502, so that it is necessary to discriminate whether or not each condition is established. Therefore, while the computing unit 502 executes the condition branching instruction, these computing units 501 to 502 can not be used for the original purpose of signal processing, moreover, the rest computing units can not execute anything for the while, either.
As explained above, in spite of including a plurality of computing units, the conventional parallel operation section requires extremely much time to execute programs containing a plurality of condition branches.
By the way, as another feature of the image signal handled in the conventional image signal processing DSP, the image signal is regarded as a two-dimensional signal in the X and Y directions on the screen when the corresponding picture image is fixed, but the signal is regarded as a three-dimensional signal concerning the X, Y and time directions when the corresponding picture image is moved. Accordingly, with respect to the image signal, there are concepts on a signal at the edge of the screen and another at the other portions thereof.
Accordingly, in case of processing the image signal having such a feature, the process at the edge of the screen sometimes differs from that at an internal portion thereof. For example, in a certain picture image, the image signal process differs at respective positions, such as the first line, the first column, the final line, the final column, the left upper edge, the right upper edge, the left lower edge, the right lower edge, and the other internal portions.
Namely, in a typical processing method using the conventional DSP concerning, e.g., a screen having a size of n picture elements in the horizontal direction and m lines in the vertical direction as-shown in FIG. 12, the image process is divided into:
a process A to be executed on the first row 9-1 except both edge columns;
a process B to be executed on the first column 9-2 except both edge lines;
a process C to be executed on the m-th line 9-3 except both edge columns;
a process D to be executed on the n-th column 9-4 except both edge lines;
a process E to be executed on the first line and the first column 9-5;
a process F to be executed on the first line and the n-th column 9-6;
a process G to be executed on the m-th line and the first column 9-7;
a process H to be executed on the m-th line and the n-th column 9-8; and
a process I to be executed on the internal part 9-9 except the above portions.
Namely, in this case, the image process over all of the screen is carried out by executing:
the process B once, PA1 the process I (n-2) times, and PA1 the process D once; PA1 the process A i times, and PA1 the process B (n-1) times; and PA1 the process C i times, and PA1 the process D (n-1) times.
Accordingly, in the conventional DSP, even in case of such a simple picture-image process, it is necessary to exactly count the number of times of execution and to correctly control the order.
Moreover, there is often another image process method in which the screen is divided as shown in FIG. 13, and the image process is changed in each divided area. In such a case, the process is usually divided into:
a process A to be executed on an area 10-1;
a process B to be executed on an area 10-2;
a process C to be executed on an area 10-3; and
a process D to be executed on an area 10-4.
In this case, the image process over all of the screen is carried out by executing:
a cycle, j times, of executing:
a cycle, (m-j) times, of executing:
In order to realize such a process, there is a known method in which a loop counter circuit is prepared, and the number of times of the process and its order are set in advance, so as to directly execute the above-described procedure.
However, in such a method for the realization, it is necessary to correctly obtain in advance the order of the process and the number of times thereof. Therefore, when the procedure is complex, it takes an extremely long time to prepare the procedure. Moreover, in such a method, a complicated loop counter circuit is required, so that the amound of hardwares is greatly increased.
Moreover, as another method of realizing the above-mentioned procedure, there is a method in which is used conditional jump technology in which is defined an objective condition as a comparison result between each address on the image memory generated from DSP and a predetermined set value. Usually, the DSP for processing picture-image signals comprises an address generation unit (AGU) for calculating each address-on an image memory in which picture-image signals are contained, and an execution unit (EU) for carrying out an arithmetic operation on each picture-image signal read out from the image memory. Moreover, the construction is reported in the literatures "A. Kanuma et al., `A 200 MHz 32 b Pipelined CMOS Image Processor`, Digest of ISSCC, pp. 102, Feb. 1986" and "T. Murakami et al., `A DSP Architectural Design for Low Bit-Rate Motion Video Codec`, IEEE, Transactions on CAS, Vol. 36, No. 10, Oct. 1989."
In the DSP having such construction for processing picture-image signals, the process method based on the conditional jump technology, in case of FIG. 13 for example, is realized by the steps of:
comparing set values i, j, m, n respectively set on the first stage with current addresses (x, y) generated from the AGU by a computing unit in the EU every time the addresses are generated from AGU, then
executing, based on the comparison result,
the process A when (x address.ltoreq.j) and (y address.ltoreq.i);
the process B when (x address.ltoreq.j) and (y address&gt;i);
the process C when (x address&gt;j) and (y address .ltoreq.i); or
the process D when (x address&gt;j) and (y address&gt;i).
However, in such a method, it is necessary to carry out the comparison between these set values and the addresses from the AGU on every address generation by the computing unit in the EU for executing the conditional jump. Therefore, in the period when the comparison process is executed, it is not possible to process data directly concerning the picture image, so that the process efficiency is drastically degraded.
As explained above, in the DSP for processing picture-image signals, in case of executing the image process in which the process method is changed in accordance with positions on the screen, the procedure is very complicated so that the preparation of the procedure requires extremely much trouble. Moreover, it takes much time to execute or control of the procedure, so that the process efficiency in the original image process is drastically lowered.
Moreover, as still another feature of the conventional DSP for processing picture-image signals, the operation section is so constructed as to carry out arithmetic operations for obtaining sums of products.
FIG. 14 shows a typical example of a circuit for the arithmetic operation on sums of products. As shown in the same drawing, this circuit comprises a multiplier 601, an adder 602 and registers 603 to 606.
In case of carrying out the arithmetic operation for obtaining a sum of products, multiplications are carried out at the multiplier 601, then the multiplication results are accumulated at the adder 602 so as to obtain a sum of products .SIGMA.ai.bi. Incidentally, in FIG. 14, the register 605 serves as a pipeline between the multiplier 601 and the adder 602.
Generally, when the input bit width of the multiplier 601 is m, the bit number of the multiplication result becomes 2 m. Accordingly, to keep the accuracy of the multiplier 601 as it is, the input bit width of the adder 602 is also set at 2 m bits.
FIG. 15 shows construction of an arithmetic operation section which is generally used for the DSP. The operation section comprises a multiplier 621, an adder 622 and registers 623, 624. Moreover, the multiplier 621 and the adder 622 are independently connected to input buses 625, 626 and an output bus 627 so as to operate these two units independently. In this case, when the bit width of these buses 625 to 627 is m, the bit number of the multiplication result is 2 m like the case shown in FIG. 14.
Incidentally, when the multiplier 621 and the adder 622 are connected in series to perform the calculation on sums of products, to keep the operational accuracy of the multiplier 621, the input bit width of the adder 622 is also set at 2 m. However, when the adder 622 is operated independently, since the bit width of the buses 625 to 627 is m, it is enough to set the input bit width of the adder 622 at m. Accordingly, when the input bit width of the adder 622 is set at 2 m on the assumption of the operation for calculating sums of products, the rest m bits remain unused in another case than the operation on sums of products.
On the other hand, in case of the image process, calculation of absolute values of respective differences is carried often out for the purpose of vector detection on movement, pattern matching or the like. FIGS. 16a to 16c respectively show typical examples of methods for the calculation of absolute values of differences.
In these cases, assume that input values a, b are used to obtain .vertline.a-b.vertline..
First, in a method shown in FIG. 16a, the subtraction a-b is calculated at a subtracter 641 in the first stage. If the subtraction result is negative, the sign of the subtraction result is inverted at a sign inversion circuit 642. Then, the inverted value is selected by and outputted from a selector 643.
Incidentally, since the sign inversion circuit 642 performs a process in which all bits are inverted and 1 is added to each inversion result, the process requires almost the same process time as that required for the corresponding additional operation. Accordingly, in such a method, it is very difficult to achieve a high-speed operation.
On the other hand, in a method shown in FIG. 16b, the subtraction a-b is calculated at a subtracter 644. If the subtraction result is negative, the subtrahend and the minuend respectively kept in registers 645, 646 are exchanged. Then, the difference b-a is calculated at the same subtracter 644. Namely, according to this method, 2 clocks are required when the first subtraction result is negative. Accordingly, the number of clocks to be required is changed by the operational result. Therefore, also in this method, it is very difficult to elevate the operational speed.
Moreover, in a method shown in FIG. 16c, the subtraction a-b and b-a are respectively calculated by two subtracters 647, 648 in parallel. Then, the positive one is selected from the subtraction results by a selector 649, thereafter the selected value is outputted. Accordingly, it becomes possible to realize a high-speed operation by this method. However, in such a method, the circuit scale must be twice as large as usual.
Accordingly, in the operation section of the conventional DSP, when the input bit width is so set as to keep the accuracy of the multiplier on the calculation for obtaining sums of products, about a half of the circuit construction remains unused on other operations than that for sums of products. While, when the adder is so constructed as to perform high-speed operations for obtaining absolute values of respective differences, about a half of the circuit remains:unused on the other calculation.
Next, an overflow process circuit for performing an overflow process peculiar to the picture-image process in the operation section of the conventional DSP for picture-image signal process is explained.
Generally, a conventional overflow process circuit can perform processes on additions, subtractions only once and accumulative additions to be subjected to the overflow process in a calculation carried out, or on accumulative subtractions when an overflowing number is a minuend.
As overflow process methods according to the overflow process circuit, there are a method of extending the bit number by one bit in the higher direction, and another method of extending the bit number sufficiently. In the method of extending the bit number by one bit in the higher direction, an arithmetic operation is carried out after extending the bit number by one bit in the higher direction. Then, the overflow is detected to be positive when the most highest two bits of the operational result are designated by 01, negative when 10, or is not detected when 11 or 00. In case that the overflow is detected, the output is replaced by the maximum code of the same polarity as of the original operation result.
However, in case of an accumulative operation, when the final result does not originally overflow, the result becomes an error. The reason is that the overflow in the accumulative operation is caused only when the number of times of exceeding the positive maximum value is different from the number of times of exceeding the negative maximum value. Therefore, the correction is carried only when the overflow is really generated because the operation result exceeds these numbers of times.
FIG. 17 shows a circuit for performing the overflow process based on the extension by one bit in the higher direction. Moreover, a procedure of overflow process in an arithmetic operation on two n-bit values is explained with reference to the same drawing.
First, data a 32 and a 33 selected by a selector 660 are respectively extended by one bit in the higher direction at a bit extension circuit 661. Then, the extended data are processed by an adder 662, and the operation result and the result obtained by clamping the operation result at an overflow process circuit 664 (which is explained in detail hereinafter) are inputted to a selector 665. At the same time, the highest one bit of the operation result from the adder 662 is deleted at a bit deleting circuit 666, then the operation result is held in a register 667. Here, clamping means a process of replacing a data by a maximum value or a minimum value which can be expressed by the bit length of the data. Incidentally, both of these data a32, a33 are not overflowing.
Then, in the second operation or after, the same process is carried out by using data a31 and a33.
The necessity of the clamping is decided by a signal 663. When the clamping is necessary, the operation result clamped at the overflow process circuit 664 is selected by and outputted through the selector 665. While, when not necessary, the operation result from the adder 662 is selected by and outputted through the selector 665.
Next, the overflow process circuit 664 is explained in detail with reference to FIG. 18. Incidentally, the overflow process circuit 664 performs judgement of overflow from the highest two bits in the operation result.
In the same drawing, when an operation result is inputted to the overflow process circuit 664, an overflow judgement circuit 680 judges that the overflow is positive when the highest two bits of the inputted operation result are designated by 01, negative when 10, or not existent when 00 or 11. When the overflow is positive, a register 681 in which 1 is held in advance is selected by a selector 684. On the other hand, when negative, a register 682 in which -1 is held in advance is selected. Moreover, when not existent, a register 683 in which 0 is held is selected. Then, the selected value and a value held in a register 685 are subjected to an additional operation at an adder 686. Incidentally, in case of an accumulative operation, the additional operation is further repeated.
By the repetition of the additional operation, the value in the register 685 is counted up in case of the positive overflow, while counted down in case of the negative overflow. Namely, in the register 685 is held the difference between the number of times of the positive overflow and that of the negative overflow. Accordingly, whether the overflow of the accumulative operation is positive or negative can be judged from the sign of the content of the register 685. Hereinafter, the register 685 is called overflow counter.
Moreover, the output of the adder 686 is inputted to a clamp judgement circuit 687. When the output is positive, a register 688 in which a positive maximum value is held is selected by and outputted from a selector 691. On the other hand, when negative, a register 689 in which a negative maximum value is held is selected and outputted. Moreover, when 0, an input value 690 of the overflow process circuit 664 is selected and outputted.
However, such an overflow process is established based on continuation of additional operations. Thus, in case of a subtraction using an overflowing value as a subtrahend, though the judgement on the sign of the overflow is completed, it is impossible to correctly carry out the process because the sign of the number of times of the overflow in the register 685 is not inverted.
Moreover, as the overflow counter value to be held in the register 685, it is impossible to select such a value that is always positive or 0 during the process. Therefore, it is not possible to correctly carry out an accumulative overflow process on absolute value operations.
However, in the preprocess such as filtering, affine transformation or fast Fourier transformation, or in the postprocess such as image recognition, histogram process or labelling, it is necessary to carry out an operation in which an overflowing value is used as a subtrahend, or an absolute value operation. Accordingly, such an overflow process circuit that can perform such operations has been desired so far.
As stated above, in the overflow process circuit for carrying out the process based on the one-bit extension in the higher direction, it is not possible to correctly carry out subtractions in which is an overflowing value is used as a subtrahend, or absolute-value operations on an overflowing value.
On the other hand, in the conventional DSP for processing picture-image signals, a plurality of registers are incorporated to perform a high-speed process. Moreover, with elevation of the LSI's integration, it has been possible to incorporate a register file in which are contained several to several tens of registers. In such a register file, various addressing modes like a memory are prepared as designating means of registers. As typical examples of such addressing modes, direct addressing and indirect addressing can be mentioned.
In the direct addressing mode, a specific field in an instruction directly designates an address of the register file. While, in the indirect addressing mode, a data value at an address designated in an instruction is used as an execution address. In a DSP adopting the Harvard architecture, since data memories and instruction memories are separately provided, an exclusive or general register not connected to a data bus is designated as an address register in an instruction, and the content designates an address of the register file.
Next, a method of selecting the addressing mode in the conventional DSP is explained with reference to FIG. 19.
The direct addressing mode or the indirect addressing mode is selected in accordance with an addressing designation field 943 designated in an instruction register 941. If the direct addressing mode is selected, an address signal 955 outputted from an address field 945 is selected in accordance with an addressing designation signal 953 outputted from the addressing designation field 943. If the indirect addressing mode is selected, an address signal 957 outputted from an exclusive or general address register 947 is selected. In accordance with the selected address signal, a selector 951 designates a register in a register file 949 corresponding to the address. Incidentally, the register file 949 gives/receives a data to/from a data bus 913.
Thus, in such a conventional DSP, as the addressing modes of the register file 949, there are only two addressing modes, that is, the direct and indirect addressing modes. Therefore, it is not possible to designate contents of a specific address in the register file 949 memorizing data on the data bus 913. Otherwise, if possible, the execution time becomes markely long.
Namely, in the direct addressing mode, an address value is designated in an Instruction, and it is not possible to designate contents of a register having a designated address. Therefore, it is impossible to designate contents of a register as an address. Moreover, in the conventional DSP adopting the Harvard architecture, even in the indirect addressing mode, it is also impossible to designate contents of a data as an address when an exclusive register not connected to the data bus 913 is used as the address register 947. Also, in case that a general register is used as the address register 947, since data in the register must be once transferred to the address register 947, the execution time becomes extremely long.
On the other hand, as one application of the picture-image processing technology, there is a histogram process on the density distribution of the picture image. Next, this process is explained with reference to FIGS. 20 and 21.
In FIG. 20a, an address to be accessed (designated by an arrow of a solid line) is generated from an AGU (address generation unit) 923 and sent through an address bus 935 to an external memory (hereinafter called picture-image memory) 933 in which picture-image data to be processed are memorized. Then, a data (designated by an arrow of a dotted line) of the accessed picture-image memory 933 for showing density of the picture image is inputted to the AGU 923 through a data bus 913. Thereafter, as shown in FIG. 20b, an address to be accessed is generated from the AGU and given to an external memory (hereinafter called histogram memory) 959 for preparing a histogram from the data. Then, as shown in FIG. 21a, a value in the histogram memory 959 corresponding to the address given from the AGU 923 is transferred to an EU (execution unit) 961. Thereafter, as shown in FIG. 21b, the transferred histogram data is increased by 1 for increment in the EU 961, then the increased value is written at the same address on the histogram memory 959.
Incidentally, in the respective processes shown in FIGS. 20b to 21b, since the same address of the histogram memory 959 is accessed, it is impossible to carry out the pipline operation in the respective construction of the same drawings. Therefore, the number of steps required for the process is quite increased. Moreover, since the histogram memory 959 is an external memory in the conventional histogram process, the access time is very restricted. Furthermore, it is necessary to carry out increment on the histogram data by using the EU 961.
Accordingly, in the conventional register file, since it is not possible to designate a data memorized in the exterbnal memory as an address, it takes quite much time to access the register file. Moreover, since it is not possible to carry out the pipeline operation on the histogram process, the number of process steps must be increased, or since no internal memory can be used for the process, the access time is very restricted. Furthermore, it is necessary to carry out data increment after the data is once transferred to the EU. Thus, the execution time of the histogram process must be very long.