The semi-conductor industry is facing a disconcerting circumstance: there are no longer any credible routes for significantly increasing the performance of processors, at least not at the individual level. Only systems using several processors operating in parallel still seem to constitute an encouraging route for increasing the computational power of systems. Indeed, studies conducted in the 1960s have shown that the ratio of computational power to efficiency of computational systems is potentially much higher for parallel systems than for sequential systems. The question can then arise of knowing why parallel systems did not become prevalent sooner, especially in the field of embedded systems which are basically highly centered on optimization and efficiency. On the one hand, the technology did not allow the integration of massively parallel structures on one and the same component, with the exception of SIMD (“Single Instruction, Multiple Data”) structures which are easily programmable when the application is tailored to this type of parallelism. On the other hand, generally, parallel systems are much more difficult to program and to develop. Such is the case notably for symmetric systems, also called homogeneous systems, based on the replication of the same processing element and possessing identical and homogeneous access and communication interfaces. Such is less the case, however, for asymmetric systems, also called heterogeneous systems, which use several specialized processors for processing operations and particular interfaces. Asymmetric systems have been prevalent for a long time, for example for conventional peripherals of the video or network chip type, but they nevertheless remain limited as regards the number of processors placed in parallel. It should be noted that, generally, this prevalence has occurred in application fields that are not very complex at the processing control level, that is to say in which the heterogeneity of the resources limits not only the complexity of the mapping of the processing operations but also the flexibility of the mapping of the processing operations. However, specialized multiprocessing systems have also appeared in embedded systems. In the field of mobile telephony, “multicores” on a single chip have appeared which can contain DSPs (“Digital Signal Processors”) for signal processing, GPUs (“General Purpose Processing Units”) for ordinary processing operations, as well as analog input/ouput blocks. In the field of personal stereos or multimedia players, decoding cores dedicated to audio (“MPEG Audio Layer”, “Dolby D”, “DTS”) or to video (“MPEG”, “H264”) have appeared in addition to the general-purpose processor. Symmetric parallel systems are for their part less developed, notably because of the difficulty in handling the programming and because of the inextricability of the fine tuning of the programs. Generally, these difficulties of programming and fine tuning are exacerbated by the ever increasing complexity of the applications. In embedded systems, these difficulties are also exacerbated by the desire to integrate ever more functionalities and by the continual increase in the volumes of data to be processed. For example, mobile telephones associate telecommunication functions with multimedia functions, positioning functions, or else games. Mobile telephones use video sensors of ever greater capacity and converters of ever higher throughput. Moreover, intensive-computation tasks run alongside tasks dominated by control, with very strong interactions between these various elements of the applications.
The invention relates more particularly to the field of embedded systems offering high computational power. New applications in fields such as multimedia, communication, or real-time processing systems demand ever more computational power for controlled surface areas and levels of power consumed. As already explained previously, short of being able to increase the processing powers of the computational elements in an isolated manner, the only realistic solution is to multiply the computational elements and to operate them in parallel. Within this framework, a new concept is currently making its appearance, that of the parallel system on chip. In theory, parallel systems on chip allow more efficient use to be made of the additional transistors that can be integrated on one and the same chip on account of advances in etching techniques. Even within the fairly specialized framework of processors for embedded systems, this trend to increase the number of execution cores on one and the same chip is very marked. In the medium term, this trend ought to mark the introduction or indeed the making prevalent of systems with several tens or indeed hundreds of execution elements. Among these systems may be cited multiprocessor systems on chip, usually designated by the acronym “MPSoC” standing for “Multi-Processor System on Chip”. MPSoCs are complete systems which integrate as a minimum computational elements able to operate in parallel and a complete communication architecture on chip. The communication architecture of the current MPSoCs reproduces a connection system architecture for a system composed of several macroscopic elements. It can comprise communication buses, dedicated networks on chip, usually designated by the acronym “NoC” standing for “Network on Chip”, dedicated interconnection switching systems, usually designated by the expression “crossbars”, input/ouput interfaces, random access memory, usually designated by the acronym “RAM”, local memories, cache memories or “scratchpads”. But most of the time, the communication architecture of an MPSoC comprises a combination of all this. The essential problem of the mimicry of communication architectures on chip in relation to macroscopic architectures is that macroscopic architectures are envisaged for very regular processing operations, whether these be massively parallel computational processing operations, stream processing operations or server tasks. Now, applications on embedded systems are increasingly tending toward much less regular and much less predictable processing operations. The communication architecture of MPSoCs must therefore be rethought. Indeed, the implementation of efficient parallel systems on chip with high-level performance such as MPSoCs makes it necessary to operate tens or indeed hundreds of computational cores or processing elements in unison. If this is not the case, then the use of parallelism is not optimal. This implies that several tens or indeed several hundreds of processing elements are not used correctly, that is to say they have a rate of use that is not fairly high. Hereinafter, the processing elements will be designated by the acronym “PE” standing for “Processing Element”. But to exploit parallelism in an optimal manner, the difficulties are multifold. At the software level, a difficulty is that of providing the programmer with simple and accessible tools for expressing in code the whole of the potential parallelism of an application. Another difficulty at the software level is the ability to derive the greatest benefit therefrom when compiling this code. But these very complex software problems are not the subject of this patent application.
To efficiently exploit a parallel architecture, it is necessary to tackle the problem under the three-fold aspect of the control of the indeterminism, of the control of the communications and of the control of the checks. Indeed, once a potential parallelism has been extracted from an application and expressed in a program, it must still be possible to actually implement this parallelism in a given hardware architecture. In an MPSoC for example, in order to derive the greatest benefit from the work of extracting the application parallelism done by the programmer, numerous processing sequences must be successfully distributed over all the resources of the chip, these sequences being inter-related by dependencies of data or of execution control. Hereinafter, these sequences will be called execution tasks. An execution task therefore relates to the execution of a processing operation on a PE. It is generally called a “thread” by software specialists. By default in the remainder of the present patent application, the term “task” alone refers to an execution task. Without any consideration pertaining on the one hand to the way of choosing the PEs and on the other hand to the way of operating them together, it is very improbable that the architecture can actually implement the whole of the parallelism expressed in the program. In some sense, in the same way as the program expresses the potential for parallelism of the application, it is necessary to find a means of expressing the potential for parallelism of the architecture through appropriate control of the tasks. The consideration must take into account all the situations which may be detrimental to good use of the potential parallelism of the architecture. This involves firstly the risks of being limited by the access to an essential shared resource such as the central memory, a network, a communication bus or a task manager. It also involves the risks of not being able to manage in a sufficiently precise manner the interdependencies between the tasks, or of not being able to manage them without tailoring to the particularly dynamic character of certain applications. Finally, it involves the risks of not being able to control the indeterminisms of the parallel execution, making it complex and tricky to fine tune the programs. The consideration must culminate in an execution model which defines the way of choosing the PEs and the way of operating them together. Making several tens or indeed several hundreds of PEs operate together in an efficient manner within one and the same chip is currently one of the major challenges which the microelectronic industry has to meet. At the present time, techniques for programming parallel applications are markedly more difficult to implement than techniques for programming sequential applications, both from the standpoint of the design and that of the fine tuning of the programs. In order to progress the parallel programming models toward better accessibility to the programmer, it is necessary for the execution model of the underlying parallel architecture to be properly tailored to this. This must however be done without thereby sacrificing the efficiency of implementation on current silicon technologies. This is one of the technical challenges which the present invention proposes to address.
For historical reasons, the exploitation of parallelism has hitherto endeavored to propose solutions making it possible to profit from parallelism at the application task level. Indeed, despite intense research around the definition of architectures capable of efficiently managing a high degree of parallelism at the instruction level, these approaches have rapidly shown their limits. At the same time, the complexity of embedded systems makes it extremely difficult or inefficient to model them in the form of a single control flow. Thus, users and architecture designers concur in favoring parallelism at the task level. Consequently, a strong trend currently observed in the field of embedded systems is the integration on one and the same silicon substrate of several processor cores allowing the execution of tasks in parallel on one and the same circuit. Several solutions have already been proposed for exploiting the parallelism of such architectures on one and the same silicon substrate. The best known models are the “SMT” model according to the acronym standing for “Simultaneous MultiThreading”, the “CMP” model according to the acronym standing for “Chip MultiProcessing” and the “CMT” model according to the acronym standing for “Chip MultiThreading”. Hereinafter, the processing units capable of managing the execution of a set of instructions will be distinguished from the computational units capable only of executing one instruction.
But the SMT, CMP and CMT models only partially address the problem of embedded systems. They exhibit notably numerous drawbacks. Indeed, as will be detailed subsequently, these models do not make any distinction between the various processing classes that can coexist within an application. Constructed on non-optimized computational primitives, these systems are often unsuited to the applicational requirements in regard to electrical consumption, cost/performance ratio and operating dependability. These are major drawbacks.
Solutions of CMP type lead to a distinction being made between regular processing operations and irregular processing operations. This involves solutions implemented on architectures which integrate computational units dedicated to intensive processing operations, the irregular processing operations being handled with the system software on a general-purpose processor. But as will be detailed subsequently, the use of system buses gives rise to lower reactivity of the architecture and an inability of the system software to optimize the use of the computational units.
To attempt to minimize these drawbacks, American patent publication US2005/0149937A1, entitled “Accelerator for multiprocessing system and method”, proposes that the mechanisms for synchronization between the computational units be handled by way of a dedicated structure. It does not however afford any solutions to the problem of data transfer between the tasks.
American patent publication US2004/0088519A1, entitled “Hyperprocessor”, proposes for its part a solution to the management of task parallelism in the context of high performance processors. It does not however apply to embedded systems, notably for determinism and cost reasons.