The limits of superscalar processors are no longer technological: they are the limits of the microparallelism of instructions. Indeed, data dependencies form an insuperable barrier within the framework of execution wherein the tasks are started in order and are terminated in order. The more that intermediate execution out of order is permitted, the more necessary it is to put in place significant dependency computation logic. On the other hand, the mean number of instructions executed per cycle (mean IPC) progresses fairly little when the instruction window is increased beyond what is currently practiced.
Therefore, the architecture of computation systems is currently undergoing significant alterations. Indeed, even though at the present time there is no fundamental doubting of the famous “Moore's Law”, which predicts the exponential growth of the number of transistors that can be implemented on a silicon chip at a given instant, the semiconductor industry nonetheless faces an admission of failure: there are no longer any credible routes to significantly increase the performance of individual processors.
It is nonetheless known from the basic work in the subject, in the 1960s, that the ratio of the computational power to the efficiency of computation systems is potentially much higher for parallel systems than for sequential processors. This is why, at all levels, on-chip parallel systems are increasingly being deployed. In theory they allow more efficient use of the additional transistors that can be integrated on one and the same chip, on account of the progress made in etching techniques.
Now, although it has been known for a long time that parallel systems are more efficient than conventional sequential systems, one could wonder why this has not become commonplace sooner, especially in the field of embedded systems, which is basically heavily centered on the optimization of the various efficiencies. But on the one hand, the technology did not allow the integration of massively parallel structures on one and the same component, with the exception of the easily programmable SIMD (“Single Instruction, Multiple Data”) structures. Moreover, parallel systems are much more difficult to program and to develop in a general manner, especially symmetric systems based on the replication of the same processing element and possessing identical and homogeneous access and communication interfaces.
In the field of embedded systems, notably that of mobile telephony, “multicores” on a single chip have appeared, which may contain DSPs (“Digital Signal Processors”) for signal processing, GPPs (“General Purpose Processors”) for ordinary processing, as well as analog input/output blocks. In the field of personal stereos or multimedia players, decoding cores dedicated to audio (“MPEG Audio Layer”, “Dolby D”, “DTS”) or to video (“MPEG”, “H264”) have appeared in addition to the general-purpose processor.
Thus, the prior art now includes models of interaction between a general-purpose processor and coprocessors or, more generically, between a main processor and auxiliary processing units. For example, units for accelerating processing, in particular for mathematical computations, have existed since the 1970s. In a certain number of cases, these units, dubbed “coprocessors”, are distinct from the so-called “main” processor. This was the case with the processors for micro-computers and work stations until the end of the 1980s. But it is still the case for embedded systems, whether it be in order to increase parallelism by potential decoupling of the two units, or whether it be to reduce costs. Indeed, a low-cost generic processor is then employed in conjunction with a separate specific processing unit generally designed “in house”, as is the case for the “Fire” vector coprocessor from Thomson.
For example, U.S. Pat. No. 6,249,858 shows one of the most recent aspects relating to the capabilities for coupling between a standard processor and a coprocessor, by allowing parallel execution of the processing activities on the two entities. The coupling is fairly tight: the main processor dispatches the computation commands to the coprocessor by providing operands and a program address in ROM. However, this requires dedicated support software, since an interrupt must be taken on the main processor to appropriately manage the call to the functionalities of the coprocessor, and another interrupt is generated by the coprocessor at the end of the computation. It thus shows how to weakly couple the main processor and its computation accelerator. Nonetheless the scheme is not generalizable to a plurality of acceleration elements. Moreover, it does not make it possible to dispense with a system support for control and for obtaining the results of the computations of the coprocessor. Neither does it make it possible to easily ensure consistency in the dependencies of computations. The latter are a priori the remit of the programmer, this generally being difficult on a parallel system where the processing activities may be strongly heterogeneous. This also renders scale-up extremely difficult and reserved for specialists in parallel programming.
Another example, the GPUs (Graphics Processing Units) of modern graphics cards described in U.S. Pat. No. 6,987,517 may be considered to be sets of auxiliary units specialized for single program, multiple data (SPMD) vector computation. In this case, there is a weak coupling between this multitude of units and the control processor, since the problem processed is massively parallel. Indeed, one and the same processing has to be performed on sets of distinct data, to compute pixels in a memory buffer. But it is not significant that there is an error at a given moment since, because the error rate remains low, the user is not inconvenienced. Moreover, there are no means of simply accessible synchronization, as the problem is intrinsically parallel. The only significant synchronization occurs at the end of the processing of an image, so as to add post-processing stages or simply to display on the screen the computed pixels.
Another example, the American patent application published under the number US2008140989 (A1) describes methods for distributing processing activities over auxiliary units. But the methods described in this patent application do not offer any simple means for managing parallelism at several levels automatically. Notably they do not allow coexistence of parallelism at the level of the tasks on the main processor, termed “coarsegrain” parallelism, and parallelism at the level of the “threads” on the auxiliary processors, termed “fine-grained” parallelism. Moreover, the management of the determinism of execution relies on management of the parallelism by the programmer, this generally being difficult for typical applications of embedded systems.
Generally, during the execution of a task, the aforementioned solutions of the prior art afford little autonomy in the management of the auxiliary processing units, the system software often having to intervene for the execution of the task. On the contrary, if a task is disabled, these solutions of the prior art do not permit the implementation of the system software for task switching, consequently limiting the use of the various parallelisms. Consequently, they confer only a global determinism of execution that is very far from that conferred by a conventional Von Neumann architecture.
Thus, it is clearly apparent that the architects of computation systems are in a relative technological impasse between a single-processor paradigm, which is showing its limits, and on-chip multiprocessors, better known by the acronym “MPSoC” standing for “MultiProcessor System on Chip” or by the acronym “CMP” standing for “Chip MultiProcessing”, which are difficult to program. Most current architectures utilize either parallelism of tasks termed “processing parallelism”, or parallelism of instructions termed “instruction micro-parallelism”, or else a combination of the two for MPSoCs.
Parallelism of processing activities is the parallelism of applications or tasks. Although some development systems make it possible to program such systems at the application level, it is difficult to use more than about ten processors for a standard application. It is of course possible to envisage a multi-application framework, but the problem then remains of effectively managing execution beyond 8 or 16 processors in the SMP (“Symmetric MultiProcessing”) configuration. Moreover, the ordinary applications require a rewrite and a partial redesign to exploit this, such as for example the implementation of “execution threads”, with the POSIX (“Portable Operating System Interface”) standard.
Micro-parallelism of instructions is that which is used in superscalar processors to execute more than one instruction per cycle. But as explained above, the limits of this technology in terms of efficiency are being reached.
The limitations mentioned above may lead to the exploration of a new level of intermediate parallelism, which could be called “mesoparallelism” or “medium-grained” parallelism. This would entail intermediate parallelism between processing parallelism and instruction micro-parallelism. Various computation units would still cooperate to execute code sequences of one and the same task in parallel. But this time, the main program and the synchronization problems would be implemented on a control processor, whereas the sections of intensive computations would be implemented on specialized processors.
But implementing such mesoparallelism is not devoid of numerous difficulties. Since on the one hand, to overcome the limitations of instruction micro-parallelism, it is necessary to circumvent as far as possible execution of the program in order. Whereas on the other hand, it is necessary that the application code be as close as possible to a code of sequential type known to the programmer.