Demands for processing of video signals, audio signals, and the like by a digital signal processor (referred to as DSP hereinafter) is increasing nowadays in particular due to the fact that multimedia data such as videos and audios are handled more with computers, mobile terminals, digital audio equipment, or the like. Further, the volume of each data as the processing target is increased, and the content of each data processing is becoming sophisticated more and more. Furthermore, in order not to spoil the real-time property of the operations, it is required to execute such data processing in a short time.
Therefore, many of the recent DSPs are of such type which does not use exclusive-use hardware but uses an exclusive co-processor that is connected to a mounting co-processor core for executing such data processing.
Normally, a large number of processor cores are loaded to an exclusive LSI containing such DSP, so that the circuit scale of each processor core greatly influences the entire circuit scale. Reductions in the size of the device, the power consumption, the cost, and the like are required at all times, so that required is the DSP with which a sufficient processing capacity can be achieved for signal processing use while the circuit scale of each processor core is reduced as much as possible.
Normally, with the signal processing by the DSP, it is known that the time for executing loop processing in which a same command sequence is repeatedly performed occupies an extremely large proportion among the entire execution time. However, delay in the processing due to command invalidation of pipeline processing that is the so-called branch penalty exists in the branch command that is required when executing the pipeline processing. Thus, the branch penalty becomes greater as the number of loops is increased, which results in deteriorating the processing speed.
This will be described in more details. FIG. 7 is an explanatory chart showing the structure of a normal DSP 801. The DSP 801 is in a structure (the so-called Harvard architecture) in which a command memory (program memory) and a data memory are separated, which is constituted with a program control circuit 810, a computing unit 820, a register 830, and a data memory 840. The program control circuit 810 includes a command memory address generation circuit 811, a command memory 812, and a command decoder 813.
A command code stored at the corresponding address on the command memory 812 is read out and sent to the command decoder 813 according to the command address generated by the command memory address generation circuit 811. The command decoder 813 decodes the command code, and generates a control signal for controlling the computing unit 820 and the register 830. The computing unit 820 and the register 830 process the data stored in the data memory 840 with the control signal according to the command.
FIG. 8 is an explanatory chart showing the processing executed in the signal processing by the DSP 801 shown in FIG. 7. A command sequence 900 in which the commands to be processed by the DSP 801 are written is stored in the command memory 812. The command sequence 900 is constituted with individual commands 900a to 900s. Among those, the command 900n is the “loop processing branch command” in which a branch condition is written, and command the 900f is the “branch destination command” that is the returned command when the branch condition is satisfied.
The DSP 801 executes the command sequence 900 in order from the command 900a. In the command 900n that is the loop processing branch command, the DSP 801 judges whether or not the written branch condition is satisfied. When judged that it is unsatisfied, the following processing is shifted to the command 900o. When judged that it is satisfied, the processing is returned to the command 900f. That is, the DSP 801 repeatedly executes the processing of the commands 900f to 900m as long as the branch condition applies in the command 900n, and advances to the processing of the command 900o and thereafter when the condition becomes unsatisfied. In that sense, the commands 900f to 900n are generally referred to as the “loop command sequence”.
FIG. 9 is an explanatory chart showing the more detailed structures of each of the commands 900a to 900s contained in the command sequence 900 shown in FIG. 8. Each of the commands 900a to 900s is processed in a four-stage pipeline of a command fetch (IF), a command decode (DE), an operation (OP), and a write back (WB). The DSP 801 which processes the command sequence 900 invalidates the two-cycles of commands following the command 900n that is the loop processing branch command, i.e., the commands 900o to 900p when the branch condition is satisfied. This is the branch penalty. Further, the DSP 801 returns to the command 900f after the branch penalty of the two cycles is generated, and continues the processing.
FIG. 10 is an explanatory chart showing the execution order of the command sequence 900 in a case where the branch penalty shown in FIG. 9 is generated in the DSP 801 shown in FIG. 7. As shown in FIG. 10, the state where the commands 900o to 900p are invalidated is repeated until the branch condition becomes unsatisfied. That is, delay of the processing caused by the branch penalty is increased as the number of loop times until the branch condition becomes unsatisfied is increased. In FIG. 10, the branch penalty section is expressed as “nop (no operation)”.
There are followings as each of the technical documents related thereto. Patent Document 1 among those discloses a processor which prevents generation of the branch penalty by a method called branch prediction that will be described later. Patent Document 2 discloses a processor which prevents generation of the branch penalty by delay branch that will be described later. Patent Document 3 discloses a processor which prevents generation of the branch penalty by using “previously fetched address” and “to-be-fetched address”.
Patent Document 4 discloses a processor which adds an inputted relative branch destination address to a program counter value, replaces it with an absolute branch destination address, and outputs the replaced branch command. Patent Document 5 discloses a processor which prevents generation of the branch penalty by using an executable condition that shows whether or not the branch command is to be executed.
Patent Document 6 also discloses a processor which prevents generation of the branch penalty by delay branch that will be described later as in the case of Patent Document 2. Patent Documents 7 and 8 disclose a processor which prevents generation of the branch penalty by a hardware loop that will be described later.
Patent Document 1: Japanese Unexamined Patent Publication 2002-259118
Patent Document 2: Japanese Unexamined Patent Publication 2004-013255
Patent Document 3: Japanese Unexamined Patent Publication 2004-030137
Patent Document 4: Japanese Unexamined Patent Publication 2008-165589
Patent Document 5: Japanese Unexamined Patent Publication 2009-053861
Patent Document 6: Japanese Unexamined Patent Publication Hei 01-256033
Patent Document 7: Japanese Patent No. 3656587
Patent Document 8: Japanese Patent No. 3739357
Various methods have been proposed for reducing the branch penalty in the DSP. Among those methods, each of the methods such as “delay branch”, “hardware loop”, and “branch prediction” will be described.
The delay branch is the method depicted in Patent Documents 2, 6, and the like described above. This method substantially reduces the branch penalty by executing an operation command that is to be performed within loop processing during a period called a delay slot that is between the execution of the branch command and the actual branching. However, with this method, the effect of reducing the branch penalty cannot be achieved in a case where the operation command cannot be allotted to the delay slot in a fine manner.
The hardware loop is the method depicted in Patent Documents 7, 8, and the like described above. This method performs loop branch processing and judging processing by means of hardware. In the processor depicted in Patent Documents 7 and 8, exclusive circuits for addresses of the start loop and the end loop and for counting the number of repeated executions of the loop processing are mounted thereby to achieve the loop processing by means of hardware.
FIG. 11 is an explanatory chart showing the structure of the command memory address generation circuit 811 of a case where the loop processing is performed by hardware loop in the DSP 801 shown in FIG. 7. The command memory address generation circuit 811 includes: a loop start address save circuit 811a and a loop end address save circuit 811b, which save the front and end addresses of the loop processing, respectively. The command memory address generation circuit 811 further includes: a loop end detection circuit 811c which detects whether or not it has reached the end of the loop processing; and a loop number counter circuit 811d which counts the number of repeated executions of the loop processing.
Through employing such structure, the DSP 801 can perform the loop processing by means of hardware by comparing the program counter 811e with the loop processing end address saved in the loop end address save circuit 811b by using the loop end detection circuit 811c and outputting the loop processing head address saved in the loop address save circuit 811a to the program counter 811e when the both match with each other. The loop number counter circuit 811d counts the number of execution times of the loop processing, and advances to the following processing when it reaches the given number of times.
With this method, it is not necessary to allot the operation to the delay slot. Thus, the loop processing can be performed without generating the branch penalty at all times. However, the circuit scales of the loop start address save circuit 811a, the loop end address save circuit 811b, and the loop end detection circuit 811c, which are the exclusive circuits added for the loop processing, are relatively large, so that the circuit scale of the processor core is increased by mounting those.
For example, in a case where the address pulse width of the command memory is 32 bits, required is a register that is capable of storing 32-bit data for all of the loop start address save circuit 811a, the loop end address save circuit 811b, and the loop end detection circuit 811c. Further, calculations of 32-bit data are also required. As a result, the circuit scale is increased.
Further, with the hardware loop circuit provided with the loop number counter circuit 811d, the repeat number is a fixed number given in advance so that it is difficult to change the repeat number flexibly according to the state of the data being executed.
Branch prediction is the method depicted in Patent Document 1 and the like described above. This method saves the branch origin address and the branch destination address, and takes the address from which the next command code is read out as the branch destination address when the command address matches the branch origin address and it is predicted that the branch can be done. This method can be expected to achieve the effect of improving the processing performance further than the case of the above-described hardware loop in respect that it is possible to reduce the branch penalty not only for the loop processing but for various kinds of branching.
However, for performing the branch prediction, a branch prediction mechanism and a table for storing the branch origin address and the branch destination address are required. Thus, the circuit scale is increased further than the case of the hardware loop. Further, normally with the DSP, branching for the loop processing is executed mainly and branching for others is hardly executed. Therefore, it is considered that the effect corresponding to the increase in the circuit scale cannot be achieved, so that it is rare for the branch prediction mechanism to be employed for the actual DSP.
The techniques capable of overcoming the issues regarding the branch penalty in the DSP described above and the issues regarding each of the above-described methods are not depicted in the rest of Patent Documents 3 to 5, either.
The object of the present invention to provide a digital signal processor, a program control method, and a control program capable of reducing deterioration in the processing performance caused due to generation of the branch penalty in the loop processing while reducing the circuit scale.