1. Field of the Invention
The invention relates to parallel processing and was developed with specific attention paid to the possible application to embedded systems and multi-core System-on-Chips.
Throughout this description reference will be made to acronyms that are of common usage in the art of embedded systems and related areas. A glossary of the most common acronyms used in this description is reported below.
GlossaryAPIApplication Programmers' InterfaceCMCluster ManagerHAHigh AvailabilityHPHigh PerformanceMACMedium Access ControlMPIMessage Passing InterfaceOSOperating SystemPMPower ManagementPVMParallel Virtual MachineSMPSymmetric Multi ProcessingSMTSymmetric Multi ThreadingSoCSystem on ChipSSISingle System ImageVLIWVery Long Instruction Word
2. Description of the Related Art
Clusters of workstations are being used nowadays as a cost effective replacement for mainframes in scientific applications (high-performance clusters, HP). Each node in a cluster may be a single processor or a symmetric multiprocessor (SMP). Usually the connection among cluster nodes is a dedicated high-speed link but clusters may also be formed by connecting hosts on the Internet. Another domain where clusters are used is high-availability (HA) servers, where Single System Image (SSI) middleware provides the cluster application programmer the illusion of working on a single workstation.
A key factor for cluster efficiency is inter-processor communication, which, in turn, has a strong dependency on application partitioning. In order to take advantage of the computational power that is available by clustering several workstations together, applications usually need to be re-written. In HP clusters, tasks on different processors communicate with such libraries as MPI, PVM, and P4, so applications need to use the API's defined by those libraries. In HA clusters, the main problem to solve is load-balancing, so a middleware layer (that can also be implemented in the OS) takes care of moving processes among cluster nodes in order to guarantee that nodes are equally loaded (from a CPU and memory point of view). Notable examples are the openMosix and Beowulf projects.
With slight differences, in both solutions an application only needs to fork and the middleware layer can move the child process to a different node depending on its priority in the cluster and its current load. Processes use shared memory to communicate with each other, while the middleware layer re-routes system calls to processes that have been migrated.
More generally, present-day embedded systems are required to support applications with growing complexity, and computational power demand increases proportionally. To satisfy this requirement, multi-processor solutions are currently being investigated. However, in order to fully exploit the available computational power, applications should properly support parallelism.
The field of parallel processing or multiprocessing in general has been extensively investigated in the last twenty years. Solutions have ranged from transputers to clusters of workstations, with specific focus on a number of key issues, namely: 1) efficient communication bus to interconnect processing nodes, 2) cache coherency in non-uniform memory architectures and 3) message passing libraries to enable communication among process tasks in different nodes. Links to background material can be found, for example, at the Internet address http://www.classiccmp.org/transputer.
In U.S. Pat. No. 6,564,302, hardware arrangements are described that enable a cluster of processing nodes to synchronize hierarchical data caches in order to efficiently exchange data and access external shared memory. The method described requires dedicated hardware support to implement cache coherency.
In U.S. Pat. No. 6,134,619, a hardware-aided method to accomplish effective pass-on of messages between two or more processors is described, while US-A-2003/0217134 discloses a method for flexible management of heterogeneous clusters, such as those that can typically be found in web search engines systems, where three different clusters are in charge of web-spidering, data storage and data mining. Such an arrangement accomplishes efficient communication between clusters by using data gathering services to send data operating information.
In US-A-2003/0130833, a solution is proposed for the quick deployment and reconfiguration of computing systems having virtualized communication networks and storage. This document does not address the problem of running distributed applications among multiple processors but proposes a solution that has a marked impact on computer interconnections structure and storage area design. It targets multi-processing enterprise systems stressing on network load balancing and failover features without taking into account any power consumption issues.
In US-A-2003/0050992 the problem of discovering service processors among a multi-node computing system (such as a server system) is addressed. The relative arrangement claims to free OS and management consoles from having to know where different hardware services are located within a network of heterogeneous and function-dedicated nodes.
US-A-2002/0112231 discloses a method of automatically loading different software modules in different hardware platforms by means of some kind of a database that maps univocally a hardware card to a software module. The corresponding solution is essentially static and is meant to free operators from the burden of manually uploading software modules into relevant hardware modules. Also, no power efficiency problems are addressed.
EP-A-1 239 368 proposes a method of distributing complex tasks among multiple low-powered devices via a wireless interface. This prior art document does not take into account the possibility of executing different jobs on dedicated nodes either, and, again, power consumption issues are neglected.
Still another document related to the same subject-matter topics is US-A-2002/156932 which again does not optimize overall system power consumption and does not take into account processors performance tuning according to applications requirements.
Additionally, U.S. Pat. No. 5,590,284 discloses a dynamically configurable communication bus among transputer nodes separated into a serial path for real-time control commands and a fast parallel bus for large data transfers. Dedicated hardware is needed in each communication node to manage high-speed data transfer. The concept of master and slave nodes is also introduced, the master role being time shared among nodes. The communication bus is designed to support dynamic topology reconfiguration, task redistribution among nodes and maximum data transfer rates. This prior art document addresses the problem of dynamic reconfiguration of communication resources, which is overly complicated for usual embedded systems, where the master node is fixed.
Both US-A-2002/188877 and US-A-2002/147932 address the problem of power consumption in multiprocessing systems. Specifically, US-A-2002/188877 refers to an SMP system with a Java virtual machine where a dedicated application moves threads of execution to different CPUs and at the same time controls their low-power modes. The system tries to determine the minimum number of CPUs required to perform a specific task, distributes threads accordingly and puts the unnecessary CPUs into a low-power mode. This approach requires SMP hardware and has a rather coarse-grained power control. The arrangement described in US-A-2002/147932 is a multiprocessing system with fine-grained power control on individual CPUs, based on feedback received by temperature and noise sensors.