This invention relates to a general purpose electronic numeric parallel computer with multiple processors. MIMD (Multiple Instruction stream Multiple Data stream) in the Flynn""s classification model, latency reduction oriented, and relates also to its composing processors.
Replication or regularly interconnected and co-operating processing elements or nodes can improve performances, reliability and costs of computers.
MIMD with multiple processors, also called MULTI, consists of a collection of processors that, through an interconnection structure, either share a global memory or only communicate without memory sharing. The formers are also called multiprocessors, the latter multicomputers.
Beyond advantages, current MULTI still have inconveniences and disadvantages.
To communicate among parallel processes, multiprocessors adopt the same processor-memory communication mechanism, which results flexible and suitable for whatever computation, but the shared memory becomes a xe2x80x9cbottleneckxe2x80x9d as the number of approaching processors increases.
The complexity and the costs of multi-ported memory can be bounded only at the expense of increased memory latency, or reducing memory traffic by using local cache memories, which add further complexity and costs to manage their coherency protocols.
Within multicomputers each processor has a private memory that allows less latency and more scalability, but existing communication mechanisms do nor allow efficient communication among parallel processes. The communication of a message requires an input/output operation. Even associating a message to a high priority interrupt, its average latency remains greater than the one of the access to the shared memory.
The used interconnection structures determine the topology, the node or connection degree, and several performance characteristics of MULTI. Any direct connection among nodes also requires a node interface. Since the node decree increases with the growing of the number nodes, the interconnection costs rapidly prevail on the machine costs. According to current technologies, the node degree must be held necessarily low, even if this increases the probability of congestion and conflicts, makes the communication latency inconstant, and performances result dependent on the space-time distribution of the traffic and on the application itself. In order to have flexible and accessible networks at acceptable costs, optimal topologies are used as well as switching and message combining elements, buffers, routing and flow control techniques, all of which make the current interconnection structures hard to manufacture, and still too expensive and inefficient by the performance point of view.
The degree of parallelism matches the number of processors, but the total computing power also depends upon the power of the single processors. Actual realisations have constraints by which these two power factors are not independent. The parallel processes communicate on globally shared resources with limited capacity and this generates congestions and/or access conflicts which degrade the expected performances either with the growing of the processor number and with the growing of the single processor power.
Within MULTI the difficulty to synchronise the parallel processes strongly reduces the number of applications that can take advantage of a parallel execution. Problems do not reside in distributing a common iso-frequential timing signal to all processors, as it is ordinarily done within SIMD too, but mainly in the impossibility to predict the exact execution time of a process. Each processor has its own autonomous sequence control, and as time passes, parallel processes become timely unrelated one another, in a way that they are not controllable by the programmer.
Synchronisation is achieved indirectly through communication. Current methods are based on message passing in the multicomputers and on access control to memory shared variables within multiprocessors. These operations, performed mostly at software level with many instructions of the ordinary repertoire and few specialised instructions (test and set, fetch and add, etc.), still result too slow, penalising, the communication time. Moreover they generate messages that increase the traffic congestion. Therefore most of MULTI built so far are unsuitable for synchronising a large number of small processes, and for strongly reducing the execution time (latency) of a single task.
Within MULTI it also exists the load balancing problem that aims to optimise use of resources by uniformly distributing the load among processors. Migration, or movement of allocation to resources after the initial decision, has been taken into account as a solution to the dynamic load balancing problem, though it has been noticed its validity also for reducing the network load, making the communication partners closer. With multicomputers the process migration is more burdened because it also requires to copy memory, therefore the migration of simpler entities is used. Convenience of run-time migration is doubtful because the transferring overload is hardly balanced by performance increments, therefore process migration from processor to processor is seldom used in highly parallel computers.
MULTI usually employ normal microprocessors available on the market and also used in SISD machine. Otherwise, they employ dedicated processors with special mechanisms for fast interrupt handling, fast context switching, with several register banks or windows, or they integrate communication/routing message interfaces and units. However the used processors are equipped with full and autonomous fetch/execution capability, and configured as memory bus masters, that once activated continuously fetch and execute instructions, but normally do not allow accessing to their own internal registers from outside, except for debug purposes. Computing nodes in multicomputers are usually multiprocessors with an application processor and one or more communication and switching routine processors, to overlap communication time with processing time, even if that increases the parallelism cost.
Aim of the invention is to find an optimal combination of processor replication and inter-connection, as well as modalities of process execution and co-operation, and to devise the appropriate structural and functional processor modifications, in a way to achieve a parallel processor or MULTI, without said inconveniences, having an optimised and very performing interconnection structure to allow an efficient communication and synchronisation among parallel processes, to reduce easily single task execution and completion time (latency). The posed technical problem is big and hard one because of the high number of possible choices at both the physical and the logical level, concerning several aspects of parallelism, investigated for long time but difficult to understand and to resolve.
The proposed solution, as per claim 1, consists in the direct pairing between processors of separate memory buses, in way that two tightly coupled processors can reciprocally synchronise themselves and share the internal register files, for allowing an easy communication and synchronisation between the two adjacent parallel processes of the pair, and in adopting the process migration among redundantly replicated processors on the same memory bus, to allow each process to communicate/synchronise itself with several adjacent parallel processes.
The pairing is accomplished through mutual extension of internal buses from one processor to the functional units of the other one. So the single processor also becomes a pair communication unit, normally connected to memory and peripherals, but mainly connected to another processor. More processors are connected on the same memory bus, for accessing equally rather than concurrently to the same instructions and data in the shared memory. Each memory bus is managed as a single master bus, wherein processors co-operate to the execution of a single sequential migrating process. Beyond the memory bus, processors also share a process migration structure that allows to transfer process control and xe2x80x9ccontextxe2x80x9d contained within state registers, from one processor to another one of the bus. Thus the run-time process migration among processors is achieved easily preserving identity and continuity of each process.
Processors are modified to eliminate concurrent access conflicts to the shared memory. They are formed to be, on the memory bus, either master-active like a traditional processor, either slave-inactive like a peripheral which does not perform processing activity, but that allows accessing and loading of its internal registers by the outside. A slave processor remains inactive for an indefinite period of time, awaiting to receive control and to resume processing activity starting upon received context. Processors of the same bus are individually paired with a processor belonging to a separate memory bus so as to form pairs between distinct memory buses.
The outcome processor architecture offers new instructions in two categories:
migration or intra-bus communication instructions, for handling the (sequential) interaction among processors on the same memory bus and allowing the run-time process migration;
pair communication or inter-bus communication instructions, for handling the (parallel) interaction within the pair, and allowing communication and synchronisation among parallel processes.
The parallelism comes out of the plurality of processes which simultaneously run on as many memory buses, and migrate on their own bus among paired processors to communicate/interact and synchronise themselves.
A multicomputer/multiprocessor formed in accordance with the invention has many advantages.
Congestions and access conflicts to global shared resources, which become xe2x80x9cbottlenecksxe2x80x9d as parallelism increases, are substantially eliminated.
Parallel processes communicate through high performance local dedicated buses, which do not require interface controllers and input/output operations. The communication among processors is based on local registers sharing, normally and efficiently achieved by hardware, and easily controlled by special pair communication and process migration instructions.
The dual access to the processor registers by both of the units of a pair allows a variable time interaction among parallel adjacent processes, and also to program the synchronisation points. Processors have direct access to the sequence control of the adjacent processors, and this allows the programmer to control mutual proceedings and time relations among all adjacent processes in parallel. Communication time between adjacent processes is mainly influenced by the process migration operation that requires a definite constant time. Therefore, also owing to the lack of conflicts/congestions, communication latency among adjacent processes is constant and can be on average lower than that in a multiprocessor. Additional devices to mask communication latency or to overlap communication and computation times, are no longer needed. Synchronisation is possible without global traffic generation and even without explicit communication. It is possible to program synchronising barriers and to achieve, within the short execution time of some new specialised instructions, the explicit synchronisation of many small processes with dimensions of few instructions, preserving also the asynchronous and efficient process execution and the other implicit synchronisation modalities. Thanks to these capabilities they can efficiently execute parallel and even synchronous algorithms. The interconnection structure, composed by inexpensive, low latency, wide bandwidth buses of ordinary make, results optimised in complexity, costs and performances. It does not force topology and machine connection degree, on the contrary it allows to obtain different topologies with a high connection degree, without the need of switching or buffering components. Within regular machines, the connection degree is given by the number of processors per bus, that is only limited by physical parameters which constrain the bus and processor dimensions, but bus bandwidth no more constitutes the main obstacle to the numeric growth of processors attachable to it.
The degree of parallelism matches the number of memory buses, and it can be freely increased, with a proportional increase of the total power, independently of the single processor power. The reducible dimensions of the pairing connections and the opportunities offered by the microelectronic (VLSI) technology allow to design and build a single logical biprocessor unit, whose integration leads to further advantages in terms of modularity, resource sharing and different part numbers.
In summary, the invention maintains about all of the advantages, but fixes most of the disadvantages, of current MULTI in both categories, with low, medium and high degree of parallelism.
The achieved process migration constitutes a xe2x80x9ccontext/process switchingxe2x80x9d wherein more processors share the single process that controls the switching. There is no formatted data packet, and no maximum time interval exists within which a processor will surely lose or receive control.
No computer has ever adopted a processor relay executing for a single sequential process. The expensive functional inefficiency given by the redundancy of inactive processors is only justifiable by the rake-off gained with parallelism. On each memory bus the situation is only structurally similar to that of shared bus multiprocessors, but functionally very different. Besides the pair connections, in the invention the processors are neither standard nor all simultaneously active-masters, and they do not compete for resources and do not engage conflicts in a casual and asynchronous way. Migrations take place tidily under software control.