The invention aims to propose a processing solution for systems having the following properties:                High calculation power: the complexity of embedded applications is increasing all the time. This is explained in particular by the drive to integrate ever more functions into embedded systems (combination of multimedia, gaming, telecommunications, and positioning functions in a mobile telephone) and the increasing volumes of data to be processed (capacities of video sensors, high-speed converters, etc.). Embedded systems must further be able to “digest” concurrently multiple streams of information. It is therefore indispensable to collect, distribute and process efficiently all information from the distributed units in the system. This necessity for concurrent processing of a number of information streams combined with the open nature of the systems is also reflected in multitask execution environments.        Flexibility: the target systems are required to be open. Thus any user of the system is free to use it as they wish. The architecture must therefore be sufficiently flexible to support very different utilization scenarios. This openness prevents overall offline optimization of the architecture as the application contents cannot be fully determined at the design stage. Moreover, although some classes of algorithms favor static division of processes with simple control of parallelism (defined offline), others require a dynamic control stream, and this trend is likely to increase with the increasing complexity of embedded applications.        Deep integration into the environment: the systems being developed must also be deeply integrated into their environment. This integration is reflected in severe real time, power consumption, cost, and operating reliability constraints.        Heterogeneous processing: because of the diversity of the applications and the complication of the control streams in embedded systems, a very wide variety of types of processing must cohabit within embedded architectures. Thus intensive calculation tasks run alongside tasks where it is the control aspect that dominates, with very strong interactions between these different elements of applications.        
To summarize, the target embedded systems have high capacities for processing heterogeneous data streams, strong possibilities for dynamic adaptation to the environment, and good communications capacities adapted to demand. They are also strongly constrained by the external environment (power consumption, real time, etc.) and are required to be open, meaning that the same product can be intended for more than one use. This includes in particular multi-application systems within which tasks can be created, suspended, destroyed, etc. dynamically (i.e. during execution).
In such systems, envisaging offline optimization of the architecture is a problem as the impossibility of accurately determining the scenarios of use leads to underuse of resources. Conversely, it is preferable to concentrate on online optimization of the calculation structure, eliminating the necessity to predict all utilization scenarios. The impossibility of optimizing the architecture offline imposes the provision of control mechanisms that are very costly in performance terms, however. The object of the present invention is to propose a calculation structure in which integrating dynamic control solutions is not achieved to the detriment of performance.
In the context of the race for performance, the use of parallelism is historically linked to solutions providing the benefits of parallelism at the level of operations or instructions within applications. Despite intense research into defining architectures capable of managing efficiently a high degree of parallelism at the instruction level, the limits of these approaches are all too apparent. At the same time, the complexity of embedded applications makes modeling them in the form of a single control stream extremely difficult or ineffective. Thus users and architecture designers are agreed on emphasizing parallelism at the task level. Consequently, a strong trend at present is the integration onto the same silicon substrate of a number of processor cores, enabling parallel execution of tasks in the same circuit.
In this race for performance, a number of solutions are envisaged classified by the method employed to exploit parallelism. The main models are Simultaneous MultiThreading (SMT), Chip MultiProcessing (CMP), and Chip MultiThreading (CMT).
For example, the SMT technique is used in the latest generations of Intel, IBM and HP Alpha processors. It uses a plurality of program counters in order to supply the calculation units with instructions from a number of streams of instructions. The interdependency of tasks being limited, instruction level parallelism (ILP) as seen by the processor is increased and processor performance is consequently also increased. Implementing these solutions is difficult, however, and the complexity of the stages for reading and distributing instructions is very high in these solutions. Consequently, these architectures lead to very large circuits, incompatible with the constraints of embedded systems, in particular in terms of cost and power consumption.
FIG. 1A is a block diagram showing the theory of an SMT architecture. Calculation units or functional units FU are fed processing by a unique control resource CP associated with a task assigner TD. In each cycle, the control block CP associated with the task assigner TD assigns instructions to the functional units FU concurrently as a function of the availability of data and any operating problems. The functional units cooperate with a shared memory space SMS.
FIG. 1B shows an example of operation of a structure having four functional units FU. In this figure, each square 1 represents an instruction and the vertical black lines 2 represent the instruction assignment and control tasks.
The squares 3 marked with a cross correspond to time slots that are not used by the functional units because of the dependencies of data or resources.
The CMP solution is generally preferred in embedded systems because of its relatively simple implementation.
The theory of this solution is to distribute tasks concurrently to calculation resources according to their availability. Each calculation resource then executes the tasks assigned to it one after the other. These architectures are divided into two families, homogeneous structures and heterogeneous structures:                Heterogeneous structures: these structures integrate calculation units that are heterogeneous and optimized for a given application domain, the distribution of tasks to these resources being identified beforehand at compilation time. The software partitioning effected at compilation time simplifies the mechanisms for distributing tasks (dynamically) at run time. These application-oriented solutions include in particular the OMAP, VIPER, PNX and Nomadic platforms.        Homogeneous structures: these structures are based on integrating homogeneous calculation units, which can be generalist, as in the IBM Cells platform or the ARM MPCore platform, or optimized for a given application domain, like the CT3400 from Cradle Technologies, optimized for MPEG4-AVC coding/decoding. The former solutions target very wide ranges of problems, whereas the latter solution is optimized for a clearly identified application domain.        
FIG. 2A is a block diagram showing the theory of a CMP architecture. The calculation units (functional units) FU that cooperate with a shared memory space SMS are fed processing by a single control resource CP associated with a task assigner TD. The control unit CP associated with the task assigner TD is responsible for determining the tasks ready to be executed. As soon as a calculation resource is released, it is assigned a task that is processed as soon as the data is loaded. These areas 4 are shown cross-hatched in FIG. 2B, which shows an example of operation for a structure with four functional units FU, with squares 1 representing instructions and vertical black lines 2 representing instruction assignment and control tasks.
Multiprocess and CMT architectures are a combination of the previous two models. The CMP concept is extended to authorize execution of multiple tasks on the calculation primitives.
This technology is envisaged essentially in the context of server-type solutions.
FIG. 3A shows a generic CMT architecture model. Calculation units (functional units) FU are fed processing by a single control resource CP associated with a task assigner TD. The functional units FU cooperate with a shared memory space SMS.
FIG. 3B shows one example of the operation of a functional unit FU.
The control unit CP associated with the task assigner TD is responsible for determining the tasks ready to be executed. As soon as a calculation resource is released, it is assigned a task that is processed as soon as the data is loaded. This is represented by the cross-hatched areas 4 in FIG. 3B, whereas the squares 1 represent instructions and the vertical black lines 2 represent instruction assignment and control tasks.
Each calculation resource can manage a number of tasks concurrently. As soon as a task is blocked, for example because of a lack of cache capacity, the functional unit FU replaces it with a new one. Under such circumstances, task switching within the functional unit is not reflected in context loading penalties.
Despite emulation based on these architectures using the parallelism of instruction streams (threads) to enhance performance, these architectures, whether of SMT, CMP or CMT type, address only partially the problems of embedded systems. The main cause of this state of affairs is the lack of distinction between different processing classes cohabiting in an application. Thus processes in which control is strongly dominant are handled in an equivalent manner, on the same processing resource, as regular processing that is critical from the execution time point of view. The calculation resources then having to support regular processing just as much as highly irregular processing, systems based on non-optimized calculation primitives result, and are therefore ill-matched to the application requirements from the three-fold point of view of electrical power consumption, cost/performance trade-off, and reliable operation.
However, a few CMP-type solutions make a distinction between regular and irregular processing. These architectures then integrate calculation resources dedicated to implementing intensive processing. Irregular processing then uses the system software on a generalist processor. Although the integration of calculation resources dedicated to intensive processing allows optimization that improves the performance or energy efficiency of these architectures, the inefficiency of communication between processing tasks and between processing tasks and the system software or control processing loses the benefit of such optimization at system level. Communications between the various elements of the architecture use system buses, attracting high penalties at the latency and bandwidth levels. Because of this, these systems are penalized by the latency accompanying the transmission of control information and by the bit rate, disturbing the transfer of data. These penalties are reflected in a less responsive architecture and by the inability of the system software to optimize the use of the calculation resources.
To minimize this overhead, according to the document US2005/0149937A1, the mechanisms of synchronization between the calculation resources are the responsibility of a dedicated structure, but solutions are no longer applied to the problem of transferring data between those tasks. The document US2004/0088519A1 proposes a solution employing management of task parallelism in the context of high-performance processors, but the proposed solution cannot be applied to embedded systems, in particular for reasons of cost and determinism.
The solutions currently being developed to exploit parallelism at task level therefore cannot address all of the constraints referred to above. SMT-type solutions, for example, are typically based on standard generalist processors onto which an additional control stage has been grafted. However, these solutions do not solve the problems of power consumption and determinism inherent to current generalist processes and, in addition, they increase complexity in order to manage a number of threads concurrently.
Despite the variety of implementations of CMP-type architectures, it is equally difficult to adopt a solution addressing the problems of embedded systems. Firstly, application-oriented solutions do not offer sufficient flexibility and, secondly, more generalist architectures do not offer calculation solutions and continue to be based on costly solutions developed for generalist processors. Similarly, CMT solutions, although extending the parallelism of the architectures, still do not address the power consumption requirements and continue to be confronted by problems of managing the consistency of the data and of communication in the circuit.