Thanks to the progress of device technology, the operating frequencies of CPUs are rapidly improving. Meanwhile, the improvement of the memory access latency during access from a CPU to a main storage is slow when viewed in terms of absolute time. This is because the memory access latency is restricted by the physical distance from the CPU to the main storage and by the characteristics of the main storage element. This means that the access latency time is prolonging when viewed in terms of the unit operation time (=1 second/operating frequency) of the CPU. In this context, the access latency tends to become a bottleneck in performance improvement.
The means of compensating for such a relative decrease in performance due to the deterioration of the memory access latency is the cache memory. The cache memory is a means of reducing effective memory access latency by providing a high-speed small-capacity buffer in a position close to a CPU and registering copies of data high in the frequency of use.
Modern computer systems usually employ either a shared-memory-type parallel computer arrangement with multiple CPUs in each of which is mounted the above-mentioned cache memory and among all or some of which is shared the main storage, or a clustered arrangement with shared-memory-type parallel computers. Multiple CPUs are mounted for two purposes: (1) performance improvement, and (2) availability improvement (this prevents the system itself from failing, even if a failure occurs in one CPU). It is essential that a computer called the “server” in computer services should take a shared-memory-type parallel computer arrangement with not less than two CPUs.
When multiple CPUs each having a cache memory share a main storage in this way, coherence control of the cache memories, so-called cache coherence control, becomes a problem. More specifically, when data registered in the cache memory of a CPU (A) is updated with a “store” instruction by another CPU (B), the update results need to be incorporated into the cache memory of CPU (A). In other words, data within the cache memory needs to be updated or nullified.
Such cache memory coherence control is typically conducted through a bus. This is realized by combining a mechanism in which data updates by a processor are broadcast to all CPUs through a bus, and a mechanism in which each CPU snoops through and checks the bus at all times and incorporates broadcast update information into the data registered in the cache memory.
During the above-mentioned cache coherence control using a bus, cache coherence control requests are likely to become congested on the bus. Therefore, if a large-scale shared-memory-type multiprocessor arrangement with a number of CPUs is realized only by bus connection, each CPU will decrease in performance. With respect to this problem, the frequency of occurrence of congestion can be reduced in comparison with that of the bus scheme. This can be achieved by providing, in internal data of each cache memory, a directory for memorizing which processor has registered data in the cache memory, and transmitting a cache coherence request only to necessary processors in accordance with the information registered in the directory. This scheme is employed in so-called NUMA (Non-Uniform Memory Architecture) multiprocessors. For NUMA-type multiprocessors, reference should be made to “The Stanford Dash Multiprocessor”, IEEE Computer, Vol. 25, No. 3, pp. 63–79 (March 1992), written by Lenoski, D. et al.
Since it is free of a section on which requests from all CPUs concentrate, such as an inter-CPU bus, the NUMA type has the advantage that performance can be enhanced scalably with increases in the number of CPUs. However, in bus-type multiprocessors, coherence control is executed with low latency immediately after a request has been sent to the bus. NUMA uses a procedure in which, once a coherence control request has occurred, it is first routed through a circuit for judging whether coherence control is to be performed on other CPUs, and then transferred from this circuit to the intended CPU. In general, NUMA has the disadvantage that since the delay time in coherence control is long, small-scale systems are inferior to bus-type multiprocessors in terms of performance.
U.S. Pat. No. 6,088,770 discloses a technology for constructing a system in which multiprocessors of the bus type are connected in a NUMA format with each such multiprocessor as a unit. Also, U.S. Pat. No. 6,088,770 discloses a technology that allows NUMA control to be reduced when a NUMA system is split into partitions. More specifically, a coherence control overhead can be reduced as follow. A main storage is split into the areas to be used only within partitions, and the areas to be used both within and between partitions. Then, the bus-type multiprocessors located within these partitions are broadcast for access to the areas to be used within partitions. Such broadcasting is referred to as multicasting. For example, if the two types of partitions are set only for one range in which the CPUs are connected via the bus, operations on the areas to be used only within partitions will be processed at high speed by the bus, and operations on the areas to be used both within and between partitions will be controlled by NUMA. Therefore, scalability and high-speed processing will be achievable at the same time.
In an actual system, for the partitions and other elements defined in U.S. Pat. No. 6,088,770, there can occur a request for dynamic setting for a purpose of, for example, modifying the elements for each program or modifying them for the migration of an arithmetic process during program execution. However, technical information on dynamic modification of the partitions is not disclosed in U.S. Pat. No. 6,088,770.
The present invention is intended to realize a multiprocessor system that allows dynamic setting of partitions and simultaneous achievement of bus-type high-speed processing and NUMA scalability.