The present invention relates to a parallel computer system of shared memory type which is used for information processors, especially personal computers (PCs), workstations (WSs), server machines, etc., and more particularly to a control method for a main memory.
In recent years, the architecture of a multiprocessor of the shared memory type (SMP) has spread to use in host models of PCs and WSs, server machines, etc. This architecture has become an important feature for the enhancement of performance in shared-memory multiprocessors that share main memory, for example among multiprocessors having a large number, such 20xcx9c30, processors.
Extensively used as a method of constructing a shared memory multiprocessor is a shared bus scheme. With the bus scheme, however, the throughput of the bus causes a bottleneck, and hence, the number of connectable processors is limited at most 8 or so. Accordingly, the bus scheme is not suitable as a method of connecting a large number of processors.
Conventional methods of constructing shared memory multiprocessors each having a large number of processors connected therein are broadly classified into two schemes.
One of them is crossbar switch architecture, and it is disclosed in, for example, xe2x80x9cEvolved System Architecturexe2x80x9d (Sun World, January 1996, pp. 29-32). With this scheme, boards each of which has a processor and a main memory, are connected by a high speed crossbar switch so as to maintain the cache coherency among the processors. This scheme has the merit that the cache coherency can be rapidly maintained.
The scheme, however, has the demerit that, since a transaction for maintaining the cache coherency is broadcast to all of the processors, traffics on crossbar switch is very high and causes a bottleneck in performance. Another demerit is that, since the high speed switch is required, a high cost is incurred. Further, since the transaction for maintaining the cache coherency must be broadcast, it is difficult to realize a system having a very large number of processors, and the number of processors is limited to ten to twenty.
In the ensuing description, this scheme shall be called the switch type SMP (Symmetrical MultiProcessor).
The other scheme provides a multiprocessor employing a directory based protocol, and it is disclosed in, for example, xe2x80x9cThe Stanford FLASH Multiprocessorxe2x80x9d (The 21st Annual International Symposium on COMPUTER ARCHITECTURE, Apr. 18-21, 1994, Chicago, Ill., pp. 302-313). With this scheme, a directory, which is a bitmap indicative of those caches of processors to which the data line is cached, is provided for every data line of the main memory, whereby a transaction for maintaining the cache coherency among the processors is sent only to the pertinent processors. Thus, traffics on switch can be noticeably reduced, and the hardware cost of the switch can be curtailed.
Since, however, the contents of the directory placed in the main memory must be inevitably checked in submitting the transaction for maintaining cache coherency, the scheme has the demerit that an access latency is lengthened. Further, the scheme has the demerit that the cost of the memory for placing the directory increases additionally.
As stated above, the switch type SMP and the directory based protocol have both the merits and the demerits. In general, with the switch type SMP, a hardware scale becomes larger, and a scalability in the case of an increased number of processors is inferior, but a higher performance can be achieved. Accordingly, a system in which the number of PCs, server machines, etc. is not very large (up to about 30) should more advisably be realized by using the switch type SMP.
Another problem involved in constructing a shared memory multiprocessor is the problem of reliability. Each of the shared memory multiprocessors in the prior art has a single OS (Operating System) as the whole system. This method can manage all the processors in the system with the single OS, and therefore has the advantage that a flexible system operation (such as load balancing) can be achieved. In the case of connecting a large number of processors by the shared-memory multiprocessor architecture, however, this method has the disadvantage that the reliability of the system degrades.
In a server of cluster system wherein a plurality of processors are connected by a network or in MPPs (Massively Parallel Processors), individual nodes have different OSs, so that even when a system crash occurs on one node because of, for example, OS bug, the system is down only at the corresponding node. In contrast, in the case of controlling the whole shared-memory multiprocessor system by the single OS, when system crash occurs on a certain processor because of a system bug or the like, the OS itself goes down, and hence, all the other processors are affected.
A method wherein a plurality of OSs are run in the shared memory multiprocessor for the purpose of avoiding the above problem, is disclosed in xe2x80x9cHive: Fault Containment for Shared-Memory Multiprocessorsxe2x80x9d (15th ACM Symposium on Operating Systems Principles, Dec. 3-6, 1995, Copper Mountain Resort, Colo., pp. 12-25).
With this method, the shared memory multiprocessor conforming to the directory based protocol is endowed with the following two facilities:
(1) The whole system is divided into a plurality of cells (partitions), and independent OSs are run in the respective partitions. The system has a single address space, and the respective OSs take charge of different address ranges.
(2) A bitmap which expresses write accessible processors is provided every page of the main memory, and write access is allowed only for the processors each having a value of xe2x80x9c1xe2x80x9d in the bitmap.
More specifically, in a case where data is to be written into the main memory of each processor (in a case where the data is to be cached in compliance with a xe2x80x9cFetch and Invalidatexe2x80x9d request, or in a case where a xe2x80x9cWrite Backxe2x80x9d request has arrived), the contents of the bitmap are checked, and only the access from the processor having the value of xe2x80x9c1xe2x80x9d in the bitmap is allowed.
Owing to the above facility (1), even when the OS of any partition has crashed, it is possible to avoid the other partitions going down. Further, owing to the provision of the facility (2), the processor of the partition having crashed due to a bug can be prevented from destroying data which the other partitions use.
As thus far explained, the reliability of the system can be sharply enhanced by dividing the interior of the shared memory multiprocessor into the plurality of partitions.
In the case of constructing a switch type SMP and further dividing the interior of the SMP into partitions, as stated in the Prior Art, there are three problems to be mentioned below.
(A) Slow Access to Local Main Memory
In a case where the processor accesses the main memory included in the same board, ideally it ought to be accessible at high speed without passing through the crossbar switch.
In actuality, however, the transaction for maintaining the cache coherency must be submitted to the other processors so as to check the caches of the other processors (hereinbelow, this processing shall be called the xe2x80x9cCCC: Cache Coherent Checkxe2x80x9d). This is because there is a possibility that the copy of the accessed data has been buffered in the cache of another processor.
In the case where the data has been actually buffered in the cache of any other processor, the CCC is required. However, in a case where the accessed data is local data having never been accessed from any other processor, there is no possibility that the corresponding data has been buffered in the cache of any other processor, CCC could be omitted.
Therefore, the wasteful CCC incurs, not only the drawback that the access latency is prolonged, but also the drawback that the traffic in the switch is enlarged.
In the directory based protocol, on the other hand, the wasteful CCC does not occur because directory makes it possible to tell which processors have a copy of data line in the cache. As stated before, however, the directory based protocol has, not only the drawback that the amount of hardware for the directory is large, but also the drawback that overhead for managing the directory is very large.
By way of example, the directory of a system with 16 processors, xe2x80x9c4 GBxe2x80x9d main memory and xe2x80x9c64 Bxe2x80x9d/line requires a main memory capacity which is as large as:
4 GB/64 Bxc3x9716 bits=128 MB
Accordingly, a sharp reduction in the amount of hardware is necessitated.
(B) Addresses of Partition not Beginning at Address xe2x80x9c0xe2x80x9d
With the partition management mechanism in the prior art, the whole system forms the unitary address space. Accordingly, addresses space of each partition do not begin at address xe2x80x9c0xe2x80x9d.
Assuming by way of example that the number of the partitions is 2 and that the main memory capacity of each partition is 1 MB, the partition xe2x80x9c0xe2x80x9d has an address space of the address xe2x80x9c0xe2x80x9d to address xe2x80x9c1 Mxe2x88x921xe2x80x9d, whereas the partition xe2x80x9c1xe2x80x9d must have an address space of the address xe2x80x9c1 Mxe2x80x9d to address xe2x80x9c2 Mxe2x88x921xe2x80x9d.
The existing OSs are premised on the fact that the main memory is installed with its addresses beginning at the address xe2x80x9c0xe2x80x9d, so the above limitation is a serious obstacle in the case of using the OSs in the prior art.
(C) Large Amount of Hardware for Partition Management
In the case of employing the partition management mechanism of the prior art example, bitmap indicating whether the individual processors are allowed to access the corresponding page or not is stored for every 4 KB page. Accordingly, there is the problem that the hardware amount of the corresponding bitmap is very large.
Assuming by way of example that the number of the processors is 16 and that the main memory capacity of the system is 4 GB, a memory whose capacity is as large as:
4 GB/4 KBxc3x9716=16 MB
is required for the partition management, and an increase in cost is incurred.
Accordingly, the first object of the present invention is to realize with a small hardware overhead, a shared memory multiprocessor in which local data never accessed from any other processor can be accessed rapidly without executing the CCCs to other nodes.
Another object of the present invention is to construct a shared memory multiprocessor which, when divided into partitions, permits the local main memory of each partition to have an independent address space, thereby to begin the addresses of the local main memory at address xe2x80x9c0xe2x80x9d, and also permits the necessary areas of a main memory to be shared.
A further object of the present invention is to realize the above partition management with a small amount of hardware.
In order to accomplish the objects, the present invention consists in a shared memory multiprocessor having a plurality of nodes and a network for connecting the nodes, each of the nodes including at least one CPU and cache and a main memory, a cache coherent control being performed among the nodes by the use of the network; wherein each of said nodes comprises a table in which, in correspondence with each page of the main memory of a particular node, a first bit is stored for indicating if the corresponding page has been accessed from any other node, and in which the first bit is reset at initialization of the system of the multiprocessor and is set by hardware when the corresponding page of the main memory has been accessed from other nodes; and means operating when the CPU of the particular node accesses the main memory of the same particular node, for checking the first bit of the table as corresponds to the page to be accessed, so as to perform the cache coherent control for the other nodes in a case where the first bit is set and to inhibit the cache coherent control for the other nodes in a case where the first bit is not set.
Further, when system software allocates a page of the main memory, the bit of the table corresponding to the page to be allocated is reset by the system software.
In addition, one bit is allocated to the table as a second bit that is stored in correspondence with each page of the main memory to indicate that the cache coherent control for the corresponding page is unnecessary; and when the CPU of the particular node accesses the main memory of that particular node, the means checks the second bit so as to judge the necessity for the cache coherent control for the other nodes in accordance with a value of the first bit in a case where the second bit is not set, and to inhibit the cache coherent control for the other nodes in a case where the second bit is set.
In a shared memory multiprocessor having a plurality of nodes and a network for connecting the nodes, each of the nodes including at least one CPU and cache and a main memory, a cache coherent control being performed among the nodes by the use of the network, the nodes to share the main memory being permitted to be divided into a plurality of partitions each including at least one node; wherein the main memory of each of the nodes is divided into a shared area which is accessible from all of the nodes, and a local area which is accessible only from within the corresponding partition, and wherein separate start addresses are designated for the respective areas.
Further, each of the nodes comprises means for deciding whether an accessed address is of the local area or of the shared area, and means for deciding which of the nodes are included in the partitions; and when a command for the cache coherent control is to be issued to the other nodes, the command is broadcast to all of the nodes within a system of the multiprocessor as to the access command to the shared area and is multicast only to the nodes within the corresponding partition as to the access command toward the local area. In addition, addresses of the local areas of each of the partitions begin at address xe2x80x9c0xe2x80x9d.
Further, there are comprised means for deciding whether the access address is of the local area or of the shared area when a cache coherent command has arrived from any other node; and means for deciding whether the node of an access source lies inside the corresponding partition or outside the corresponding partition; whereby, in case of the decision that the command has arrived at the local area from the node lying outside the corresponding partition, the access is inhibited, and an error is reported.
Also, each of the nodes comprises a register for storing configuration information of the shared area, which contains a start address of the shared area, and the size of the shared area which each of the processors takes over. Additionally, each node has the configuration information of local area of each node in the partition which contains a set consisting of a start address and an end address of the local area.
Further, each of the nodes comprises means for storing distribution of the nodes within the corresponding partition in terms of a bitmap, as means for storing configuration information of the corresponding partition.