The present invention relates to a multiprocessor system having distributed shared memory, and in particular, to a multiprocessor system having distributed shared memory capable of executing instructions to efficiently execute a program and of conducting instruction scheduling to efficiently execute a program.
Recently, the operation speed of instruction processors has been remarkably increased as a result of development of semiconductor processes and logical processing methods.
In contrast therewith, the main storage is required to be increased in storage capacity and it is difficult to increase the operation speed of the storage. The main storage access efficiency is therefore a bottleneck of the system efficiency.
This problem can be solved by, for example, a method in which a high-speed, small-capacity data cache is disposed near an instruction processor to copy part of the main storage onto the cache.
When the instruction processor processes a load instruction, the processor reads data from the main storage to write the data in a register and registers the data to the data cache at the same time.
When a load instruction is issued for such data thus registered to the cache, the instruction processor can execute the loading operation for the data cache without accessing the main storage.
In general, in consideration of localization of an access pattern in the main storage, the system registers not only actually required data, i.e., critical data but also other associated data. Specifically, the data registration is conducted in block unit (for each line) in which each block includes data of several tens of bytes at continuous addresses.
When data to be used by a load instruction can be predicted, the method of JP-A-10-283192 published on Oct. 23, 1998 can be used in which a prefetch instruction is employed to beforehand registers the data to a data cache.
If the prefetch instruction can be issued in advance sufficient to complete the registration of the associated data to the data cache before the load instruction is processed, the instruction processor can obtain the necessary data from the data cache for the load instruction.
As described in JP-A-2000-339157 laid-open on Dec. 8, 2000, recent computer systems include data caches configured in a hierarchic order such that a prefetch instruction can be issued with specification of a cache at a desired level for the data registration.
On the other hand, shared memory systems includes a uniform memory access (UMA) system in which memory access latency is fixed regardless of a physical address to be accessed as shown in FIG. 2 and a non-uniform memory access (NUMA) system in which the memory access latency varies depending on a physical address to be accessed as shown in FIG. 3.
The UMA system includes a plurality of processors, a main storage controller, and a main storage.
The NUMA system is a composite system including nodes coupled by an inter-node interface with each other in which each node includes a plurality of processors, a main storage controller, and a main storage.
In the NUMA system, when a memory address of a request issued from a processor belongs to a node to which the processor belongs, the request is processed with short latency, i.e., local access latency, and when the memory belongs to a node other than a node to which the processor belongs, the request is processed with long latency, i.e., remote access latency.
The prefetch technique is particularly effective in the UMA system. The memory access latency is fixed. Therefore, when a program using the prefetch technique is complied by a compiler, the compiler can relatively easily schedule prefetch instructions in the program.
In the NUMA system, it is difficult for the compiler to assume or to estimate the memory access latency in the program compilation, and hence it is also difficult to schedule the prefetch instructions.
For example, the compiler can schedule all or perfect prefetch instructions for the remote access latency.
However, two problems arise also in this case.
Problem 1: Resources to control the prefetch being processed in the processor are kept occupied for a longer period of time (associated with the remote access latency) when compared with the UMA system.
Problem 2: When the data is too early registered to the cache of the processor, another data item at another address may possibly be written on the data before the data is used depending on cases.
For the NUMA multiprocessor system of the prior art, it is difficult to efficiently schedule prefetch instructions.