This invention relates to a network of parallel processors tolerant to the faults thereof, and a reconfiguration method applicable to such network.
The field of the invention is that of parallel computers for all kinds of applications. Two sample applications are thus given in the document referenced as [1] at the end of the description.
The increasing possibilities of micro-electronic technology, as well as the evolution of multiprocessor architectures, are leading to computers that are more and more complex both in terms of elements composing them (electronic gates, memories, registers, processors, . . . ) and in terms of complexity of the software used.
The designers of such computers having a high integration parallel or extensively parallel structure must take into account two conflicting requirements:
1 Machines having a parallel or extensively parallel structure are subject to faults due to the very great number of processors and their complexity, leading to poor manufacturing output and serious faults under normal operation.
2 With highly advanced technologies and high integration systems, more and more processors can be incorporated into an application specific integrated circuit (ASIC), a multichip module (MCM) or a card. In such systems, the main disadvantage is that of limited bandwidth, i.e. the amount of information that can be put through.
In order to meet the first of these requirements, one solution of known art consists in replacing faulty processors with spare processors which are identical to the others from an operational point of view. Such a solution, enabling xe2x80x9cstructural fault tolerancexe2x80x9d, then tries to ensure proper operation, and in particular network consistency, so as not to penalize the architecture. It implies reconfiguration consisting in replacing faulty elements with spare elements available due to interconnection elements and intercommunication elements.
In a 2D (or bidimensional) type of network, the solutions proposed for providing fault tolerance are:
adding as many processor lines to the system as faults are to be tolerated. This solution is very simple and requires few spare interconnections, reconfiguration being performed by simply bypassing the lines where there is a faulty processor. Performance loss is then limited. On the contrary, the spare processors are very poorly used as one line is required to tolerate one fault, and in case of a faulty bypass, the whole system is down.
or adding switches, spare processors and connections to the standard network.
As described in the document referenced as [2], a network corresponding to the latter type of solution and called xe2x80x9cm-Track, n-Sparexe2x80x9d is composed of processors 10, switches and spare connections. Two kinds of switches are used: switches 11 coupling processors with connections (PT=Processor to Track) and switches 12 coupling connections with each other (TT=Track-to-Track). All network links are bi-directional, i.e. communications can come and go in each connection. Spare processors 13 (sp) are positioned at the network borders. For the reconfiguration method to be effective, these processors must be positioned at least in one line and one column of the network.
FIG. 1 illustrates a sample network of the xe2x80x9c2-Track, 2-Sparexe2x80x9d type. Spare processors 13 (sp) are positioned all around the network and are used to reconfigure the network in case the useful processors 10 are faulty. Switches 11, 12 are used to enable reconfiguration. Here, the network has 200% of spare connections in comparison with the so-called operational connections.
Those skilled in the art can then use a reconfiguration method, based on error correcting codes, which can be broken down into two phases:
the first one consists in finding, for each faulty processor, a compensation track leading from the faulty processor to a spare processor;
In case the first phase is successful, each processor, along the compensation track, is replaced with its nearest neighbour, thus reaching, through cascading changes, a spare processor. The operational grid is thus maintained.
Such a network has many disadvantages:
Bi-directionality of links offers many possibilities for interprocessor routing, but has two major disadvantages in comparison with unidirectional links:
communication time is much longer, on the one hand, due to programming the link direction, and on the other hand, passing through the required circuits for providing such bi-directional communications.
complexity is increased, because interprocessor communications must be handled in order to determine the routing direction;
The number of added connections in comparison with xe2x80x9cusefulxe2x80x9d links, which is a minimum of 100%, makes such a solution inadequate for high integration parallel computers where the bandwidth of certain levels, i.e. the number of connections, is very limited;
Having to add a substantial number of spare processors can lead to problems, in particular for small networks, comprising about a hundred processors, where spare processors can be blamed for 40% of possible faults.
The reconfiguration method considered above, in turn, has two major disadvantages:
it is not suitable for unidirectional links; indeed, in this case, two connection buses, one coming and one going, are required for connecting the considered processor to each of its neighbours.
the number of switching elements passed between two logically neighbouring processors is not deterministic, which makes the method ineffective for dealing with the case of synchronous interprocessor communications.
In order to overcome these disadvantages, it is an object of the network according to the invention to solve the problem of fault tolerance in an extensively parallel architecture with significant processor coupling, by proposing a solution meeting the following constraints:
obtaining a fault tolerant network with connections that may be unidirectional;
highly limiting inoperative communication media of the network;
limiting communication time between processors by limiting the number of reconfiguration switches passed between two processors;
allowing greater flexibility for choosing the number of spare processors;
having a solution capable of supporting different processor topologies, in particular matrix, line or hypercube topologies.
The invention relates to a network of parallel elementary processors tolerant to the faults of these processors comprising said elementary processors, spare elementary processors, elements interconnecting these processors, and a control unit, characterized in that it comprises alternately a series of interconnecting element lines and processor lines, each processor being surrounded by four interconnecting elements, with the processors lines being elementary processor lines, the last processor line being a line of spare elementary processors, the edge elements of the network being interconnecting elements, and in that the control unit, connected to the processors and interconnecting elements, sends instructions to the processors, controls the interconnecting elements, and checks the integrity of these processors. Each processor is connected to four interconnecting elements, two of these diametrically opposed elements being connected to the two processor inputs, the other two elements, also diametrically opposed, being connected to the two processor outputs, these interconnecting elements being connected together through vertical or horizontal links.
Advantageously, the interconnecting elements inside the network have a complexity of six inputs and six outputs, four inputs and four outputs being connected to the interconnecting elements inside the neighbouring network, and two inputs and two outputs being connected to the neighbouring processors of the interconnecting element inside the considered neighbouring network.
An interconnecting element has at least one unidirectional output and one unidirectional input connected to one input and one output of at least one South/West, North/East, North/West, or South/East processor and at least two unidirectional inputs and two unidirectional outputs connected to at least two outputs and two inputs of the interconnecting elements located North, East, South, or West.
In one embodiment, each processor is a computing element integrating an arithmetic and logic unit, a set of work registers and a test logic allowing the automatic testing thereof to be performed. An interconnecting element is composed of several data multiplexers with n inputs to one output, each of these multiplexers being controlled by wires selecting the output channel in order to allow each of the outputs of the interconnecting element to be connected to any input, with multiplexer selections being stored in two registers inside the interconnecting element.
Advantageously, the processors and interconnecting elements can be integrated in an application specific integrated circuit. The control unit can be integrated in reconfigurable logic components.
This invention also relates to a method of reconfiguring the processor network, comprising:
a step of positioning operational processors of the logic network;
a step of routing consisting in programming interconnecting elements on the physical network, by choosing the maximum number of these interconnecting elements that can be passed between two neighbouring processors using an algorithm for searching the shortest track.
In the method of the invention:
a sequence is determined for positioning the network processors that is composed of a starting processor and a series of processors including all processors;
for each of the processors, it is tentatively positioned starting with its logical position, then, if required in case of failure, in each of the positions located at a distance 1, distance 2, . . . from the logical position of this processor, a restriction being that one and only one spare position must be used with respect to the possible positions of the previously positioned processors, stopping when S+1 positions have been tested, S being the number of spare processors;
if S+1 positions have been tested without success, returning to the previous processor in the positioning sequence and proceeding with the next position for this processor;
possibly, when all processors have been positioned, it is checked for each network dimension that the logical sequence is followed for each pair of processors, if not, the positions of these processors are inverted.
In one embodiment, the positioning sequence is defined like this: the starting processor is the top left processor, the next processors are the processors to the right and below the starting processor, and so on, following a diagonal.
It is also possible to divide the network into blocks and define a block positioning sequence starting with a starting block and going through all the blocks from one neighbouring block to the next, with the positions for the processors of one block not including any logical position of the processors of the previously positioned blocks.
Advantageously, this inventive method can be implemented either statically, or dynamically during operation.
The invention has the following advantages:
The proposed network is applied to all kinds of parallel computers in order to make them tolerant to the faults of the processors and their interconnections. Redundancy is obtained through a very low additional electronic volume (addition of a few processors identical to the others).
The proposed processor interconnection structure allows different topologies of parallel computer architectures to be supported, e.g. processor grid, torus or ring topologies.
The five constraints defined above of the problem to be solved are solved with this network and the associated reconfiguration method.
The proposed structure can be used as a complete network or as a network element, therefore with global or local reconfiguration. The structure, when used as a network element, is particularly suitable for high integration structures with very limited bandwidth. The structure, when used as a complete network, is suitable for integrated structures requiring high fault tolerance.
Whether control via logic gates allows clocking to be distributed over each processor or not, consumption as well as heat dissipation of the network can be improved and optimised. This improvement results in a reduction of the operating temperature of the structure circuits and reliability thereof is thus improved.
The invention, when adapted to a high integration structure, allows to virtually increase the output of the silicon foundry used for making the circuit. Indeed, for a conventional component, if a transistor is defective, the circuit is considered defective, whereas with the proposed invention, the circuit remains operational even with very many faulty transistors, as long as these faulty transistors are distributed inside the same processor or among processors that are not grouped physically, the only limit being the number of spare processors.