In recent years, the increase in performance of a single processor core has reached its limits. One of important options for continuous improvement of the performance has been to make a chip a multi-core chip materialized by integrating processor cores into one chip. However, a typical multi-core chip takes time to exchange data between processor cores, which makes an overhead. On this account, even a multi-core chip equipped with N cores cannot achieve N times the performance, provided that N is a natural number. Therefore, in regard to a typical multi-core chip, the performance per core deteriorates, leading to a decrease in its areal efficiency.
On the other hand, when the trend toward multi-core chips proceeds further, it becomes unnecessary that one processor core handles every task as in the past. When various processor cores are mounted on a chip and each core is made to perform processing which the core is good at, it becomes possible to increase its efficiency. When a chip to be used is a heterogeneous multi-core chip incorporating a legacy core and an engine core, the areal efficiency can be improved even if it is of a multi-core type. Here, a legacy core is one which maintains the compatibility with a conventional type general-purpose processor core and keeps the continuity of software, etc. An engine core is one which abandons the compatibility and is specialized in processing that the core is good at, whereby the efficiency is increased.
A single processor core has reached the limits in performance improvement. A factor of this is that it is tried to process a single program flow at a high speed. Even in the case where an original algorithm has parallelism, when the algorithm is described in the form of a single flow, its parallelism cannot be shown explicitly. Under a situation like this, when an attempt to draw the parallelism to an absolute maximum by means of hardware is made, a large number of hardware systems are required, which leads to reduction in efficiency. Further, even when a large area and a large volume of electric power are devoted until reaching their physical limits of mounting, the improvement of performance which offsets such efforts cannot be achieved.
For example, in the case of an out-of-order system, which is common as a system for a high end processor at present, a large-capacity buffer is used to hold a single instruction flow which uses a single program counter to manage an instruction address to be executed. Further, according to an out-of-order system, the following actions are performed: to check a data dependence; to execute instructions in the order in which collection of all the input data for instruction execution is completed; and to update the condition of a processor according to the order of the original instruction flow after instruction execution. In this case, a large-capacity register file is prepared in order to eliminate the limits of instruction produced by the antidependence of register operands and output dependency, and the registers are renamed. The result obtained by executing an instruction in advance can be used by the subsequent instruction earlier than the originally intended time, which contributes to improvement of the performance. However, update of the condition of a processor, which can be recognized from the outside when program execution is stopped halfway, cannot be made “out of order.” This is because a basic processing of a processor, i.e. to stop a program temporarily and resume the program later, cannot be performed. Therefore, the result obtained by executing an instruction in advance is accumulated by a large-capacity reorder buffer, and then written back into e.g. a register file in the originally intended order. As described above, the out-of-order execution according to a single instruction flow is a method with a low efficiency, which requires a large-capacity buffer and complicated control. For example, in the case of the reference cited, R. E. Kessler, ‘THE ALPHA 21264 MICROPROCESSOR,’ IEEE Micro, vol. 19, no. 2, pp. 24-36, MARCH-APRIL 1999, as in FIG. 2 of page 25, twenty entries of integer issue queues, fifteen entries of floating-point issue queue, two sets of eighty integer register files, and seventy-two floating-point register files are prepared thereby to enable a large-scale out-of-order issue.