The invention relates to a multiprocessor system of the type comprising a central memory, treatment processors and cache memories associated with treatment processors. It also relates to a process for the exchange of information between central memory and treatment processors via the cache memory associated with each of these processors. It also provides a new integrated circuit component, capable of equipping the multiprocessor system.
It is known that, in the most common known multiprocessor systems, all the information (data, address) is relayed by a common parallel communication bus between the central memory and the various treatment processors, which constitutes a bottleneck: its transfer rate is in effect insufficient to feed all the processors for full efficiency, from a common central memory.
For increasing the information transfer rate, a first solution consists in associating with each treatment processor a cache memory which, by the location of the information, permits reducing the demands on the central memory. However, in the case in which the volume of data shared between processors is substantial, the maintenance of coherence of the data between memories generates complementary information traffic on the communication bus which resists a significant reduction of the overall flow on this bus, and therefor removes a large part of the interest in this solution.
Another solution consists in providing the communication bus in the form of a grid network designed as a xe2x80x9ccrossbarxe2x80x9d, which permits a direct communication between each treatment processor and each subassembly of the central memory (memory bank). However, this solution is very heavy and very costly to achieve because of the very great number of interconnections, and it becomes completely unrealistic beyond about ten treatment processors. Moreover, in the case of multiple demands of several processors on the same memory bank, such a solution implies access conflicts, a source of slowing up the exchanges.
Another more current solution by reason it its architectural simplicity consists in associating a local memory with each treatment processor for storing specific data therein, and storing the transferred data in the common central memory. However, the great deficiency of this architecture is its non-transparency, that is, the need for the programmer to organize the detail of the allocation of data in the various memories, such that this solution is of a very constrained usefulness. Moreover, in the case of high volume of transferred data, it may lead as before to a saturation of the access bus in the central memory.
A solution which has been called xe2x80x9caquarius architecturexe2x80x9d has been proposed by the University of Berkeley and consists in improving the aforementioned crossbar solution by combining with the crossbar network, for the shared data, cache memories which are connected to the crossbar network, and for the shared data, distinct cache memories which are connected to a common synchronization bus. This solution contributes a gain in speed of exchange but remains very heavy and very costly to achieve.
The present invention seeks to provide a new solution, permitting considerably increasing the flow rate of information exchange, while retaining an architecture which is transparent for the user, much simpler than the crossbar architecture.
An object of the invention is thus to permit notably increasing the number of treatment processors of the system, while benefitting from a high efficiency for each processor.
Another object is to provide a structure of an integrated circuit component, permitting a very simple realization of the architecture of this new multiprocessor system.
To this end, the multiprocessor system provided by the invention is of the type comprising a central memory (RAM) organized in blocks of information (bi), treatment processors (CPU1 . . . CPUj . . . CPUn), a cache memory (MCj) connected to each treatment processor (CPUj . . . CPUj . . . CPUn) a cache memory (MCj) connected to each treatment processor (CPUj) and organized in blocks of information (bi) of the same size as those of the central memory, a directory (RGj) and its management processor (PGj) associated with each cache memory (MCj), means for communication of addresses of blocks between processors (CPUj) and a central memory (RAM). According to the present invention, the multiprocessor system is provided with:
an assembly of shift registers, termed memory shift registers (RDM1 . . . RDMj . . . RDMn), each register (RDMj) of this assembly being connected to the central memory (RAM) in such a manner as to permit, in one cycle of this memory, a parallel transfer in read or write of a block of information (bi) between said register and said central memory;
shift registers, termed processor shift registers (RDP1 . . . RDPj . . . RDPn), each processor shift register (RDPj) being connected to the cache memory (MCj) of a processor (CPUj) in such a manner as to permit a parallel transfer in reading or writing of a block of information (bi) between said shift register (RDPj) and said cache memory (MCj);
an assembly of series links (LS1 . . . LSj . . . LSn), each connecting a memory shift register (RDMj) and a processor shift register (RDPj) and adapted to permit the transfer of blocks of information (bi) between the two registers considered (RDMj, RDPj).
Thus, in the multiprocessor system according to the invention, the exchanges between the cache memories and the associated processors are carried out as in the conventional systems provided with cache memories. By contrast, the exchanges between the central memory and the cache memories is carried out in an entirely original manner.
Each transfer of an information block (bi) from the central memory (RAM) to the cache memory (MCj) of a given processor (CPUj) consists of:
transferring, in a cycle of the central memory, the block (bi) of said central memory (RAM) to the memory shift register (RDMj) (of the size of one block) which is directly connected to the central memory and which corresponds to the processor (CPUj) considered,
transferring on the corresponding series link (LSj) the contents of this memory shift register (RDMj) to the processor shift register (RDPj) (of the same capacity) which is associated with the cache memory (MCj) of the processor considered (CPUj),
transferring the contents of said processor shift register (RDPj) to the cashe memory (MCj).
In the opposite direction, each transfer of information blocks (bi) from the cache memory (MCj) of a given processor (CPUj) to the central memory (RAM) consists of:
transferring the block (bi) of said cache memory considered (MCj) to the processor shift register (RDPj) which is associated with said cache memory (MCj),
transferring on the corresponding series link (LSj) the contents of the processor shift register (RDPj) to the memory shift register (RDMj), allocated to the processor considered (among the assembly of shift registers (RDM1 . . . RDMj . . . RDMn) connected to the central memory (RAM),
transferring in a cycle of the central memory, the contents of the memory shift register (RDMj) to said central memory (RAM).
In these conditions, the transfer of each block of information (bi) is carried out, no longer through a parallel bus as is the case in the known systems, but by the series links of high flow rate. These series links permit obtaining comparable times of transfer for each block (bi) and even lower than the transfer times in known parallel bus systems. The comparative example given hereinbelow with the current values of the parameter for current technology, illustrates clearly this fact which seems paradoxical.
It is assumed that each block of information (bi) is of a size equal to 64 octets.
In the system of the invention, the transfer time between the central memory and a cache memory breaks down into:
a central memory transfer time (RAM)/memory shift register (RDMj): 100 nanoseconds (performance of a central random access memory of known type),
a series transfer time on the corresponding series link: 64xc3x978xc3x971/500.106, either 1024 nanoseconds, assuming a transfer frequency of 500 megahertz (not exceptional with current technology which permits attaining frequencies of 3000 megahertz),
a processor shift register transfer time (RDPj)/cache memory (MCj): 50 nanoseconds (cache memory of the very current type).
The total time of transfer of a block is therefor on the order of 1200 nanoseconds (while integrating the chaining delays of the second order).
In known systems with cache memories in which the exchanges of information is carried out directly in parallel by words of 4 octets (the most current systems leading to busses of the conventional type of 32 data lines), the transfer time for one block is equal to the transfer time of 16 words of 4 octets which comprise this block, that is: 16xc3x97100=1600 nanoseconds.
Thus, it is seen that, with the average hypotheses in the two solutions, these times are comparable. But, if one compares the architecture of the system according to the invention with that of a parallel bus common with cache memories (first solution mentioned previously), it will be realized that:
in the conventional solution (common parallel bus), the central memory and the common bus are occupied at 100% during the transfer, since the information circulates between the two for the entire transfer time,
in the system according to the invention, the series link is occupied 100% during the transfer, but the central memory is occupied less that 10% of the transfer time (time of memory reading and loading of the memory shift register (RDMj)), such that the central memory may serve 10 times more processors than in the preceding case (the use of the series link being without significance since it is private and directed to the processor).
It is important to emphasize that in this system of the invention, each series connection which connects each processor in an individual manner to the central memory is a simple connection (of one or two data leads), such that the series network thus constituted is not comparable in the overall plan to the complexity with, for example, a crossbar network of which each connection is a parallel connection with multiplicity of leads (32 leads of data in the comparative example above), with all of the necessary switches.
Further, as will be seen below on the comparative curves, the system according to the invention has greatly improved performance with respect to the traditional common bus systems and permits in practice operating a much higher number of processors (of several tens to a hundred processors). This performance is compatible with that of a crossbar system, but the system according to the invention is of a much greater architectural simplicity.
In the system of the invention, each series link may in practice be achieved either by means of two unidirectional series links for bit by bit transfer, or by means of a single bidirectional series link.
In the first case, each memory shift register; (RDMj) and each processor shift register (RDPj) are divided into two registers, one specialized for the transfer in one direction, the other for the transfer in the other direction. The two unidirectional series links are then connected to the divided memory shift register (RDMj) and to the corresponding divided processor shift register (RDPj), in such a manner as to permit, for one, a transfer in one direction, and for the other, a transfer in the other direction.
This embodiment with two unidirectional links presents the advantage of not requiring any transfer management on the link, but the inconvenience of doubling the necessary resources (link, registers).
In the second case, a validation logic of the transfer direction is associated with the bidirectional link such as to permit an alternate transfer in the two directions on said link. This logic may be integrated in the management processor (PGj) associated with the cache memory (MCj) to which said bidirectional link is connected.
It will be understood that each series link may ultimately be provided with a higher number of series links.
In the multiprocessor system according to the invention, the address communication means may cover essentially two forms embodiments: in the first case, it may consist of a parallel address communication bus for blocks (BUSA), common to all of theprocessors (CPUj) and connecting the latter and the central memory (RAM) in a conventional manner with an arbitrator bus (AB) adapted to manage access conflicts on said bus. It is necessary to note that this address bus is only utilized for communication of addresses of blocks: in the plan of the structure, this bus is identical to the parallel address communication bus of known systems, for which no problems of saturation are interposed, since it will be freed right after transfer of the address block.
However, another embodiment of this address communication means may be considered in the multiprocessor system of the invention, consisting in operating the series links for transfer of blocks of information (bi) to transfer the addresses of these blocks.
In this case, a complementary shift register (RDCj) is connected to each series link (LSj) in parallel with the corresponding memory shift register (RDCj). The addresses transmitted by said series link are thus loaded into each of these complementary registers (RDCj). An access management arbitrator connected to said registers (RDCj) and to the central memory (RAM) is thus provided for selecting the addresses contained in said registers and for managing the conflicts of access to the central memory (RAM). Such an arbitrator is conceivably known it itself, this type of access conflicts being now resolved for a number of years. In this embodiment, the presence of a parallel communication address bus is avoided, but the management resources are made more heavy.
Further, the multiprocessor system according to the invention is particularly well suited for managing in an efficient manner the problems of coherence of the data shared between treatment processors. In effect, the conventional solutions for managing these shared data find their limits in the known systems from the fact of the bottleneck at the level of the communication of information, but become, on the contrary, perfectly satisfactory and efficient in the system of the invention where such a bottleneck no longer exists, such that this system may be equipped with shared data management means of an analogous concept to that of known systems.
For example, one traditional solution of shared data management consists in avoiding the relay of shared data by the cache memories: in a conventional manner, a partition logic (LPj) is associated with each treatment processor (CPUj) in order to differentiate the addresses of the shared data and those of the non-shared data so as to direct the first directly toward the central memory (RAM) and the second toward the corresponding cache memory (MCj).
In a first version of the architecture according to the invention, the system comprises:
a special bus for parallel communication of words (BUSD) connecting the processors (CPUj) and the central memory (RAM),
a partition logic (LPj) associated with each processor (CPUj) and adapted to differentiate the addresses of the shared data and those of the non-shared data in such a manner as to transmit the non-shared data on the address communication means with their identification,
a decoding logic (DEC) associated with the central memory (RAM) and adapted to receive the addresses with their identification and to direct the data into the memory output either to the corresponding memory shift register (RDMj) for the non-shared data, or to the special word communication bus (BUSD) for the shared data.
This solution presents the advantage of being very simple in the architectural plan. The presence of the special parallel communication bus (BUSD) leads to better performances with respect to a solution which consists in utilizing the series connections for transferring not only the blocks of non-shared data but also the words of shared data. It should be noted that this latter solution may, in some cases, be provided in case of low flow of shared data.
In another version, the system is provided with a special bus for parallel communication of words and a special common bus for communication of addresses, of words (BUSAM) in order to transfer the data by the special word bus (BUSD), and direct the non-shared data to the address communication means (which may comprise a parallel communication bus where the communication is carried out by the series links).
The presence of a special bus for communication of addresses of words permits, in this version, to move back the saturation limit of the address communication means, in case of high demand for shared data.
Another version which will be preferred in practice in the case in which the address communication means comprises a parallel address communication bus (BUSA) consists in providing the system with a memory management processor (PGM) associated with the memory (RAM) and a snooper processor with a bus (PEj) associated with each treatment processor (CPUj) and to the corresponding management directory (RGj). The memory management processor (PGM) and each {espion} processor, of structures known in themselves, are connected to the address communication bus (BUSA) in order respectively to oversee and to treat the addresses of blocks transmitted on said bus in such a manner as to permit an updating of the central memory (RAM) and of the associated cache memory (MCj) in case of detection of an address of a block present in the associated directory (RGj).
The memory management processor (PGM) and each snooper processor (PEj) associate status bits of each block of information, holding them open as a function of the nature (read or write) of the requirements of the block which transitions on the bus (BUSA) and assures the coherence of the shared data while using these status bits which permit them to force or not write a block into the central memory at the moment of the requests on the bus (BUSA).
In the case referred to previously where the communications of addresses are made by the series connections, the management of shared data may also be assured in a centralized manner, by a memory management processor (PGM) associated with the central memory (RAM) and a processor for maintaining the coherence of the shared data (PMCj) associated with each treatment processor (CPUj) and with the corresponding management directory (RGj), each coherence maintenance processor being connected to a synchronization bus (SYNCHRO) controlled by the memory management processor (PGM), in such a manner as to permit an updating of the central memory (RAM) and of the associated cache memories (MCj) in case of detection of an address block, an updating of the central memory (RAM) and the cache memories (MCj) at each address selection in the complementary shift registers (RDCj).
As before this operation is assured due to the status bits associated with each block of information by the processor (PGM).
It should be noted that a synchronization bus of the type hereinafter defined may, in some cases, be provided in the preceding architecture where the address of blocks move on a common address bus BUSA. In this case, the snooper processors (PEj) are urged by the memory management processor (PGM) via the synchronization bus, and this only when they are concerned by the transfer. Thus, non-useful access to the cache memories is avoided. The snooper processors become then passive (since driven by the processor PGM) and they are designated more by the more appropriate expression xe2x80x9ccoherence maintenance processorxe2x80x9d according to the terminology hereinabove utilized.
Another solution consists in reserving the parallel address communication bus (BUSA) for the transfer of addresses of blocks of shared data and using the series links for the transfer of blocks of non-shared data.
Further, the multiprocessor system according to the invention lends itself to the regroupings of treatment processors on a same series link, in such a manner as to limit the series links and the corresponding memory shift registers (RDMj) necessary.
The number of memory shift registers (RDMj) may correspond to the number of series links (LSj), to which case each memory shift register (RDMj) is connected in a static manner to a series link (LSj) specifically appropriated to said register.
The number of memory shift registers (RDMj) may also be different from that of the series connections (LSj) and in particular less, in which case these registers are connected in a dynamic manner to the series links (LSj) through an interconnection network.
As in conventional systems, the central memory (RAM) may be divided into xe2x80x98mxe2x80x99 memory banks (RAM1 . . . RAMp . . . RAMm) arranged in parallel. Each memory shift register (RDMj) is then comprised of m elementary registers (RDMjl . . . RDMjp . . . RDMjm) connected in parallel to the corresponding series link (LSj). However, a level of supplementary parallelism and a better electrical or optical adaptation of the connection is obtained in a variation in which each memory bank RAMp is connected to each processor CPUj by a series link from point to point LSjp.
In order to provide transfer performance at least equal to those of conventional systems with a parallel bus, the system according to the invention is preferably synchronized by a clock of a frequency F at least equal to 100 megahertz. The memory shift registers (RDMj) and processor shift registers (RDPj) may very simply be of a type adapted to present a shift frequency at least equal to F.
In the case of very high frequencies (particularly greater than 500 megahertz with current technology), the registers may be divided into sub-registers of a lower shift frequency, and then multiplexed.
The invention also relates to a multiport series memory component, susceptible of equipping the multiprocessor system previously described, in order to simplify the fabrication. This component, which may have different applications, is constituted by an integrated circuit comprising a random access memory (RAM) of a pre-determined width corresponding to a block of information (bi), an assembly of shift registers (RDM1 . . . RDMj . . . RDMn), each of a capacity corresponding to the size of the memory, an internal parallel bus (BUSI) connecting the access of the memory and the shift registers, a selection logic of a shift register (LSR) adapted to validate the connection on the internal bus between the memory and a predetermined shift register, and an assembly of external input/output pins for the input of addresses to the memory (RAM), for the input of addresses to the selection program (LSR), for the input and the validation of transfer commands in read or write of a block of information (bi) between the memory (RAM) and the shift registers (RDMj), for the input of a clock signal to each shift register (RDMj), for the input bit by bit of a block of information (bi) to each shift register (RDMj) and for the output bit by bit of a block of information from each shift register (RDMj).
This component may be made parametrable by the adjunction of configuration registers (RC1, RC2, . . . ) permitting particularly a choice of sizes of blocks of information (bi) and of diverse modes of operation of the shift registers.