(1) Field of the Invention
The present invention relates to a program execution control device which controls programs executed by a single or a plurality of microprocessors.
(2) Description of the Related Art
In recent years, digital appliances such as digital TVs, video cassette recorders, and mobile phones are required to execute digital processing such as speech processing, audio processing, video processing and coding processing, and Graphical User Interface (GUI) operation processing. Other than such a requirement, there are also requests, increasing in number and variety, for the digital appliances to be adaptable to JAVA®, and so on. To fulfill these requests, information processing devices are commonly used, examples of which include microprocessors (including microcomputers, microcontrollers and digital signal processors (DSPs)). To improve the processing performance in response to the increased application requests, the information processing devices are being improved in terms of the operation frequency of processors, and are becoming multithreaded and multi-core, for example. Along with this, the number of pipeline stages, the scale of circuitry, and power consumption of the processors are steadily increasing.
In general, as the number of pipeline stages increases, penalty cycles occur which continue until a new instruction is fetched in executing a branch instruction, and this is one of the reasons causing performance degradation. In order to improve efficiency in executing application programs, such performance degradation needs to be suppressed especially in loop processing in which many processes are performed. In an attempt to suppress performance degradation, a method is known which: predicts that a branch is always taken in a branch instruction used in loop processing (static branch prediction); and stores the beginning instruction of a loop in a loop instruction buffer, thereby suppressing the penalty cycles occurring when a branch from the end of the loop to its beginning is taken (refer to Patent Reference 1: Japanese Patent No. 2987311, for example).
FIG. 1A and FIG. 1B are diagrams each showing an example of a program executed by a processor. FIG. 1A shows a program written in C language, and FIG. 1B shows an assembly program corresponding to the program shown in FIG. 1A.
For example, in the example of the program shown in FIG. 1B, penalty cycles may occur when a branch is taken in executing a branch instruction at the end of a loop (BRZ instruction), (in this case, when branching to an L_HEAD label which is at the beginning of the loop) and also when it is not taken (when the execution proceeds from the BRZ instruction to the following ST instruction).
Further, a method is known which suppresses the penalty cycles caused by a branch not taken in the last iteration of a loop in the loop processing, which cannot be prevented even by the above mentioned static branch prediction or dynamic branch prediction for which a branch history table (BHT) is used. This method predicts with high accuracy the last iteration of a loop in the loop processing using a loop counter, thereby suppressing the branch penalty occurring at the last iteration of the loop where the loop processing is terminated (refer to Patent Reference 2: Japanese Patent No. 3570855, for example).
Meanwhile, to suppress increasing power consumption, a method is known which also focuses attention on the loop processing and reduces power consumption by suspending resources that are not used during loop execution (refer to Patent Reference 3: Japanese Patent No. 1959871, for example).
For example, a processor, having a loop instruction buffer in which instructions to be executed in loops are stored, iterates execution of the stored instructions during the loop execution. Thus, it is unnecessary to fetch instructions from an instruction memory. As a result, it is possible to stop the instruction memory system which includes a cache system, thereby enabling power saving.
Further, a multithreaded processor is becoming effective in suppressing the performance degradation caused by an increase in the number of penalty cycles resulted from an increase in the number of pipeline stages (refer to Patent Reference 4: Japanese Patent No. 3716414, for example).
However, even by means of the static branch prediction which fixes the prediction direction according to the type of the branch instruction (for example, with a loop branch instruction, branch is always predicted to be taken or by means of the dynamic branch prediction which is based on a branch history table and the like and predicts the next judgment based on the frequency with which the past branches were taken, a prediction error cannot be prevented from arising at the branch when the loop processing is to be terminated, and this results in a branch penalty.
Especially the increase in the number of penalty cycles resulted from the recent increase in the number of pipeline stages has increasingly facilitated the performance degradation caused by branch prediction errors.
Furthermore, application programs have a characteristic that a region processed at one time is becoming miniaturized despite an increase in the total amount of processing, as seen in the trend with the video codec standard, for example. For example, processing is performed on 16×16 pixel data in the conventional video codec standard, whereas in the new video codec standard, new processing is introduced which is performed on 4×4 pixel data. This indicates reduction in the number of processing cycles in a single loop.
In addition, the reduction in the number of processing cycles in a single loop is further achieved due to the trend that processors can execute an increasing number of instructions in parallel.
As described, while the improvement in the characteristic of application programs and the increase in the number of instructions that processors can execute in parallel lead to reduction in the number of processing cycles in a single loop, there is a trend that such factors cause an increase in the number of branch penalty cycles.
For example, in the case where there are four loop iterations in loop processing and each of the iterations takes eight cycles for instruction execution, it takes 32 processing cycles for single loop processing. In this case, when the branch penalty takes four cycles, for example, execution performance degradation by over 10 per cent occurs every time the loop processing is executed.
To prevent such execution performance degradation in loop processing, the method of predicting the last loop iteration using a loop counter, typified by Patent Reference 2 presented above as an example, has some advantage in being able to predict the last loop iteration with a relatively high frequency. However, this method entails a problem in terms of application targets, software productivity, and resource investment required for hardware implementation.
As for loops to which the last loop iteration prediction by the loop counter method can be applied, the loop counter needs to be incremented and decremented by either 1 or the number of steps fixed in advance. Such restriction is essential for predicting, based on the current loop counter value, that the next iteration is the last loop iteration, that is, in a counter decrementing method, predicting that the counter value in the next loop iteration becomes equal to or smaller than 0.
Consequently, depending on the types of loops, the method of predicting the last loop iteration using a loop counter is not applicable. Control on the prediction of the last loop iteration cannot be applied in the following cases, for example: as shown in FIG. 2A, the loop variable is not incremented and decremented by a value of 1; as shown in FIG. 2B, the loop variable is not incremented and decremented by a predetermined number of steps; as shown in FIG. 2C, the loop is a while loop in which the number of loop iterations is not predetermined; and as shown in FIG. 2D, there may be a jump (a break statement) from an inner loop to an outer loop.
With small-scale software as seen in the field of DSP application in the past, it has been possible to perform algorithm transformation on each loop into a for loop in which the loop variable increments by 1. However, in today's large-scale software application field, such individual algorithm tuning is not realistic from the viewpoint of software productivity. Further, there are cases where algorithm transformation is inherently impossible.
Moreover, although the prediction method using a loop counter enables small-scale circuitry in a single-threaded-program execution environment and when the method is applied only to the deepest for loop, such a prediction method results in an increase in the circuit investment when applied to multiple loops and multi threads.
For example, when triple for loops are to be implemented by the loop counter method, a recording device is needed as hardware which holds and manages a loop counter value of each of the three loops. It may alternatively be just one physical counter register that saves the loop counter value in a stack memory and the like according to the depth of the loop, and returns the loop counter value from the stack memory to the loop counter. However, in such a case, processing cycles are required for the saving and returning processing, and thus the program execution performance degrades.
Such an increase in resources is remarkable especially in multi-threaded processors. This is because stack memories for the loop counters need to be provided as many as the number of threads that the processor can concurrently execute.
Furthermore, other than the hardware configuration in which stack memories are used, there is also a hardware configuration in which a table associated with addresses (program counter values) is used, as shown in Patent Reference 2. However, this configuration also results in large-scale circuitry since the table is needed.
As described above, the control on the prediction of the last loop iteration using a loop counter has been effective in the traditional DSP field in the past, but not in today's high-performance processors with a premise of large-scale software development, from the viewpoint of its application, software productivity, and hardware investment.