It is known that in the field of application of super-calculus processing, supercomputers are used, comprising a plurality of processing units, or calculus nodes, made on specific electronic cards and grouped together in different containing racks. The overall processing required by the application is distributed in a coordinated manner to the individual calculus nodes which therefore execute a sub-set of operations required by the application. Each processing unit effects a predefined set of calculus and processing operations. The processing units are connected to each other by means of a communication apparatus so that they are able to receive data at input to be processed and can transmit at output, to one or more processing units, the output data relating to the results of the processing.
This allows to distribute the processing among several processing units, to effect different processing tasks substantially simultaneously and to share the relative results among several processing units, optimizing the use of the processing capacity of the supercomputer.
Modern systems of parallel calculus are evolving in overall calculus power in the range of the Petaflop (1015 operations per second) with the use of an extremely high number of single calculus units (1-10 k) for installation. These sizes entail problems of various types with managing the whole system, in particular:
efficient synchronization of all the tasks relating to an application performed in parallel on various calculus nodes; as the installation sizes grow, there is a reduction in performance due to the mainly scheduling activities and I/O of the Operating System (OS jitter);
debugging of the applications written specifically for these systems: the great use of parallelism distributed on the various nodes requires instruments able to interact quickly on the system to identify and isolate possible sources of error;
efficient and flexible management of the system calculus power: simultaneous allocation of calculus power for different applications in the same machine so as to maximize exploitation of the installation;
possibility of configuring in different topologies the interconnections between the nodes of the system: some scientific problems are managed with greater efficiency using regular topologies of the 2 or 3 dimensional toroidal type (problems represented by regular patterns of points-nodes with characteristics of locality) whereas other models require less regular interconnection structures (problems related to patterns of points that are more irregular and more distant from each other, for example in problems of Molecular Dynamics) where star type or tree type topologies are more efficient.
HPC calculus systems are known, partitionable with external cabling or with reconfigurations of internal connections in calculus sub-systems with equivalent performances and isolated from each other, in order to execute different applications.
In particular, to improve the performance and overall calculus speed, supercomputers have a communication apparatus that directly connects the processing units to each other, by means of connection cables of a predetermined length. By connecting the calculus nodes directly the transfer latency is reduced to a minimum, that is, the time that has to be waited so that a data packet is transferred from one calculus node to another.
In this case, given that there can be hundreds or even thousands of processing units in a supercomputer, and that it is not physically possible to make a direct connection between all the processing units, particular connection topologies are made in which each of the processing units is physically connected to the “first nearest” nodes, that is, to another four or six adjacent processing units or nodes, in the first case obtaining a two-dimensional connection topology whereas in the second case the connection topology is the three-dimensional type (FIG. 1).
The processing units at the external extremes of the topology, that is, those that do not have a processing unit physically adjacent to close the complete connection, are in turn reciprocally connected, so as to close the respective connections on “first nearest” and achieve, for example in the case of a three-dimensional topology, a “3D toroidal” network architecture (FIG. 2).
These connection topologies to the first nearest allow to obtain excellent processing performance, especially in some specific fields of calculus application such as quantum chromodynamic physics, in fluid-dynamic simulations or others, with respect to a bus connection between the different processing units.
The connections between one processing unit or calculus node and an adjacent processing unit, or first nearest, normally provide an output connection or link and an input connection or link, both unidirectional, or the use of a bidirectional link.
These connection topologies to the first nearest also provide auxiliary connections or links to different processing units so as to increase for each processing unit the possible number of first nearest nodes, usable as alternatives. The auxiliary connections are activated selectively, excluding a corresponding main connection, according to specific commands received from a management entity of the supercomputer, so as to partition dynamically the number of processing units collaborating to effect one or more specific calculus tasks. The partitioning can be defined before the applications are executed, or in real time as a function of the needs of the algorithm.
It is known to use a network that is physically distinct from the node-to-node calculus data and information communication network, used to manage specific synchronism information between the various parallel sub-tasks of an application.
In particular, global synchronization networks for HPC systems are known, used to distribute synchronization information with minimum latency to the whole system or partition.
This is the case, for example, of signals to support collective synchronization operations such as barriers and allwait, which impose on all processes of an application to be put on standby until all the other processes reach a determinate processing point. The support to accelerate the management and transmission of this information, or for the management of global fault signals, consists in the capacity of each node in the network to execute AND type operations on the synchronization signals in transit.
Other global synchronization signals can be to support information of global events: in this case each node of the network has to perform OR type operations on the signals in transit.
It is possible to use this network to distribute signals used as global clocks by the system. Operations for debugging applications can also use this network to manage high priority global signals.
In particular, the coordination of the specific calculus tasks of the processing units of one or more partitions is managed by the synchronization network that allows to coordinate the processing units or nodes together, which belong to the same topological grid.
The synchronization network may include a daisy chain topology, that is, from node to node, a tree or star topology, or mixed, or a connection topology of the first nearest type, for example toroidal. It is important that in the synchronization network the latency is guaranteed constant, within a certain range, for every partitioning made.
The interconnections discussed are made in hardware, therefore by means of conductor tracks inside the printed circuits, connectors, cables etc.
Given that in a supercomputer there are hundreds or thousands of nodes, it is unthinkable to modify the interconnections as above in order, for example, to introduce a different topology, or to make two smaller sub-topologies (or partitions), each closed toroidally. However, in many cases this is desirable because it allows to have several supercomputers with a lower calculus capacity (independent and interconnected for example two/three-dimensionally with toroid closure) able to exploit to the maximum the overall calculus power. Networks known as “collective networks” are also known, specifically made to manage “collective” operations that require coordination of all the nodes involved in a specific processing. For example, in the case of so-called global reduction operations (Allreduce function according to MPI standard—Message Passing Interface) which require to return to a specific node the result of an operation made in parallel on a multitude of nodes, the collective network manages the transit of all the partial results of the processing and, where possible, processes the results (typically summing them) reducing the overall traffic on the network and avoiding congesting the node receiving all the messages from the transmitting processing nodes. Another operation, symmetrical to the reduction operation, is the broadcast operation, with which one datum is distributed to several nodes through parallel processing.
Optimizing the communication of this information in the network requires hardware to support the execution of simple arithmetical operations by every node in the network (for example AND, OR, XOR, whole sum, subtraction, max, min, . . . ) on the messages in transit.
The U.S. Pat. No. 7,761,687 (U.S. Pat. No. '687) is known, and describes an IBM supercomputer architecture called BlueGene, which comprises the physical networks of internal connectivity between the nodes of the system, formed by:
a single node connection network for data communication between the nodes and a toroid topology with an arbitrary number of other nodes, interconnecting all the nodes;
a collective network to manage operations of a collective type between the nodes of the toroidal network pertaining to a processing, such as global reduction operations, broadcast, point-to-point message service or under-tree for input/output, work load programming, system management, monitoring of parallel tasks and debug, so that the “services” or input/output nodes are isolated from the individual node communication network of the n-toroidal type, not interfering with the parallel computation;
“Global Asynchronous network” to supply all the nodes of the system with global synchronism information such as barriers and interrupt distribution (or other notification info).
Application US-A-2009/0259713 (U.S. Pat. No. '713) discloses an analogous supercomputer architecture, again IBM, that comprises:
a single node communication network of the n-toroidal type;
a global tree network, with the same functions as the collective network in U.S. Pat. No. '687;
a “Global Asynchronous network” to supply all the nodes of the system with global synchronism information.
Consequently, both U.S. Pat. No. '687 and U.S. Pat. No. '713 use a single node communication network of the n-toroidal type for communicating data and information between the calculus nodes. However, this solution renders these architectures rigid in terms of flexibility of the interconnection topological configurations between the nodes that can be achieved.
The international project QPACE (http:en.wikepedia.org/wiki/QPACE) is also known, which describes an architecture formed by:
a Network Processor (NWP) made with an FPGA component that implements communication between the nodes in a 3D-toroidal connection;
use of a dedicated global synchronization network (“Global Signal Network”) with a tree topology for the rapid distribution of synchronism information to the whole system, and synchronization of the nodes.
Purpose of the present invention is to obtain a communication apparatus for an HPC that defines a scalable network architecture with an extremely high number of calculus units and performances in the range of PetaFLOPS:
flexible configurable in different topologies (n-toroidal, fat tree, hybrid) and partitions of these;
scalable with constant latency in the range of a few microseconds in node-to-node communication for every configuration chosen;
able to reduce the use of the processing units for synchronization operations and other activities not directly connected to the processing (I/O, monitoring, debug, . . . );
able to obtain specific synchronization conditions depending on the algorithm executed or the execution step;
able to decide synchronization conditions for one processing node as a function of all the synchronism information pertaining to the node.
The Applicant has devised, tested and embodied the present invention to overcome the shortcomings of the state of the art and to obtain these and other purposes and advantages.
Unless otherwise defined, all the technical and scientific terms used here and hereafter have the same meaning as commonly understood by a person with ordinary experience in the field of the art to which the present invention belongs. In the event of conflict, the present application shall prevail, including its definitions.