With the development of complex and diversified applications executed on multi-core mobile platforms, the need for parallelism in terms of computation and memory access is growing incessantly. Indeed, to satisfy demanding execution constraints of certain platforms such as real-time systems for example, the execution of an application is split up into several processing blocks called tasks. This splitting makes it possible, by mobilizing several computation resources (cores, processors for example) in parallel, to reduce the computation time required for the execution of the application.
Nonetheless, memory sharing and the necessary coordination of the execution of the tasks require means for mutual synchronization of the tasks. In order to ensure effective parallelism in the execution of several tasks on one and the same platform, it is necessary to have efficient synchronization means, synchronization generally constituting the main bottleneck of shared-memory systems. Indeed, when the number of applications executed in parallel and the quantity of shared memory increase, then the time mobilized for synchronization management may increase so much as to be detrimental to the general performance of the system, notably because of polling effects, stated otherwise active standby, on account for example of the permanent polling of one or more shared variables.
An example of a difficulty occurring in the known systems is given hereinafter with regard to FIG. 1, for the implementation of a synchronization barrier, a procedure often used to create a synchronization point. For example, in an onboard system, a complex computation can be split into several less complex, parallel tasks 101, 102, 103, 104, 105. The partial results 110 obtained by the various tasks can be collected so as to carry out another computation through other tasks. The collection point 120 for various intermediate results represents a synchronization barrier. Synchronization barriers represent one of the procedures most used in parallel architectures for implementing this type of synchronization point. A task may finish its computation before the others, but it must wait for the end of all the tasks that participate in the barrier. Once all the tasks have crossed the barrier 120, they are notified and their execution is continued.
The difficulty of implementing such synchronization resides notably in the manner of managing the barrier initialization phase, the standby phase waiting for tasks that finish their computation before the others and especially notification of the end of standby. Most solutions use shared variables polled concurrently by the tasks which participate in the associated barrier. As soon as the number of tasks increases, performance degradation problems are noted. These problems are generally related to network overload and to the competing (polling) of various tasks for access to the variables associated with this kind of synchronization point. Moreover, in a dynamic operating mode, tasks which participate in a barrier may be preempted by the system for a given reason (e.g. priority). This kind of scenario can be frequent in today's onboard systems, which integrate more and more computational parallelism for more complex applications. The expression preemption of a task is intended to mean the action of the system which consists in suspending the execution of a task and dissociating it from its computation unit. When the execution of the task is resumed, the task is again associated with a computation unit (not necessarily the same as that associated before the preemption). The suspension of a task may be caused by the task itself. In general, one speaks of preemption only when the task is suspended by the system. However, subsequently, we shall speak of a preempted task to designate any task which is no longer currently executing and which is no longer associated with a computation unit.
Several known techniques make it possible to manage the concurrent access of several tasks to shared resources and to create synchronization points between several tasks (for example a synchronization barrier).
A first approach encompasses the software procedures. The latter ensure the synchronization of several tasks and mutually exclusive access to resources by using software means alone. These procedures may simply consist of a shared boolean variable indicating the status of a token: free or taken. A test loop for this variable is introduced into the code of the concurrent tasks and each task must execute this loop repeatedly until the token is obtained, before being able to access a shared resource. The token becomes free when the task which possesses it changes its state from “taken” to “free”. This type of procedure, sometimes called “Test-and-Set”, is rather inefficient, notably because of the contention phenomena that it causes. Indeed, all the tasks which attempt to obtain the token dispatch access requests at the same time and for one and the same memory location. These simultaneous requests give rise to rapid degradation in performance, not only at the level of the memory in which the token is stored, but also at the level of the connection network which ensures communication between the various hardware modules of the system. To reduce contention due to the phenomenon of active standby, several procedures have been proposed: the introduction of a test prior to the execution of the “Test-And-Set” instruction so as to avoid the systematic execution of the latter (a procedure sometimes designated by the name “Test-And-Test-And-Set”), the insertion of a lag between two successive access requests dispatched by each task, the use of tickets and/or of specific standby queues for each resource accessible concurrently. These procedures make it possible to limit the penalty in terms of execution latency introduced for certain basic synchronization mechanisms, but they reach their limits as soon as the number of processors increases appreciably. Moreover, in a dynamic operating context where the tasks can be placed on standby, preempted or migrated at any moment, while standing by waiting for a token to be obtained for example, the synchronization task becomes more complicated, since no physical resource is on active standby awaiting the event.
A second approach encompasses the hardware procedures. In order to avoid the execution latencies introduced by the software approaches, certain techniques propose to install the synchronization techniques hardware-wise. For example, according to a first technique, a network of connections dedicated solely to the synchronization operations is effected in addition to the customary networks for data transport and control. According to a second technique, certain synchronization mechanisms (such as barriers or mutexes) are implemented with logical operators (AND, OR, . . . etc.) connected together directly with physical connections. However, the hardware infrastructure used to implement the synchronization functions is generally expensive in terms of circuit area and rather inflexible, and therefore rather unsuited to the dynamic aspect of a system, notably in the distributing of the computation load over the various computation units.
A third approach encompasses the mixed software and hardware procedures. Notably, the AMO (“Activates Memory Operation”) technique moves certain computation operations, generally implemented processor side, to memory side so as to reduce the cost of interaction between these two elements. For example, Cray T3E provides a set of atomic operations implemented hardware-wise as “fetch&increment” and “compare&swap”, as described in the article by S. L. Scott “Synchronization and communication in the T3E multiprocessor”, published in ASPLOS-VII in September 1996. A second mixed technique, called TM for “Tagged Memory”, adds a bit to each data word stored in memory as latching flag for the synchronization. This technique allows fine-grain synchronization management, but it gives rise to a non-negligible cost overhead in terms of silicon area. A third mixed technique called LCU (“Lock Control Unit”) makes it possible to support preemption and migration of tasks for certain synchronization mechanisms. However, this technique requires a specific module per computation unit and this may entail a non-negligible cost overhead in terms of circuit area, in a hierarchical system with several multi-core clusters for example. Moreover, this LCU technique relies on transmissions of synchronization messages between these modules, and this may give rise to contention phenomena, in particular when the number of tasks and/or computation cores increases in dynamic operating mode, that is to say in an operating mode in which the tasks can be preempted and/or migrated.