1. Field of the Invention
This present invention relates to a multiprocessor system having a plurality of system boards connected with each other via a global address crossbar and including CPU and memory, and symmetrically assigning processing to all the CPUs by inputting addresses to the global address crossbar. More particularly, the present invention relates to a multiprocessor system that realizes reduction in latency of read from a memory.
2. Description of the Related Art
A symmetric multiprocessor (SMP) has, as shown in FIG. 9, a plurality of system boards (SBs) each of which includes a CPU and a memory, and a plurality of IO units (IOUs) connected with the SBs via a global address crossbar and a global data crossbar. The symmetric multiprocessor adopts a parallel processing system having a characteristic that processing is symmetrically and equally assigned to all the CPUs.
When a CPU requests data, the symmetric multiprocessor collects information (information to be an object of local cast described later) using the global address crossbar. The information includes information on in which memory the data is present, in a cache of which CPU the data is present, whether the data on the cache of the CPU is rewritten by the CPU, whether resources (queue and so on) necessary in transmitting the data to the CPU at the request source are exhausted, whether the data is to be rewritten by a preceding command, and whether accesses to the data compete against each other. The symmetric multiprocessor indicates, based on the information, what kind of processing is applied to the respective system boards to perform processing for allowing the CPU at the data request source to acquire the requested data.
A multiprocessor system including the SMP inputs addresses of data requested by the respective CPUs to the global address crossbar and arbitrates the addresses using the global address crossbar to determine a system board that processes the data at the addresses. The multiprocessor system notifies the respective system boards of a result of the determination to symmetrically and equally assign processing to all the CPUs.
Memory read processing executed in each of the plurality of system boards included in the SMP will be explained with reference to FIG. 10.
When a CPU 1 issues a read request designating an address of data present in a memory 2, the address is inputted to a global address crossbar 8 via a master address queue 4. The address is notified to a pipeline 3 included in each of the plurality of system boards from the global address crossbar 8 as a snoop address. Therefore, the address issued by the CPU 1 is returned to the pipeline 3 having the system board mounted with the CPU 1 at the read request source.
In response to the notification, the pipeline 3 including the system board having the CPU 1 at the read request source speculatively executes a read command designating the snoop address notified from the global address crossbar 8 as a memory read address.
In response to the speculative execution of the read command, the memory read address is queued in a slave memory read address queue 5. According to the queuing, data is read from the memory 2. The data is queued in a slave memory read data queue 6 and waits for an instruction from the pipeline 3.
On the other hand, the pipeline 3 included in the system board having the CPU 1 at the read request source collects, following the speculative execution of the read command, the information described above to be an object of local cast and local-casts the information to the global address crossbar 8.
In response to the local cast, the global address crossbar 8 collects the information described above from each of the system boards. The global address crossbar 8 performs check such as a CPU cache check, an address busy check, and a resource exhaustion check in the system as a whole to determine whether the speculative execution of the read command performed by the pipeline 3 included in the system board having the CPU 1 at the data request source is to be adopted and whether it is necessary to retry the read command. The global address crossbar 8 global-casts a result of the determination to the pipelines 3 included in all the system boards.
In response to the global cast, the pipeline 3 included in the system board having the CPU 1 at the data request source instructs, based on the notification from the global address crossbar 8, the slave memory read data queue 6 to transmit the queued data to the CPU 1 to cause the slave memory read data queue 6 to queue the data to a master read data queue 7. The pipeline 3 also instructs the salve memory read data queue 6 to discard the data queued. Moreover, in instructing the slave memory read data queue 6 to discard the data, the pipeline 3 instructs the master address queue 4 to retry the read command.
In this way, the multiprocessor system including the SMP inputs addresses of data requested by the respective CPUs to the global address crossbar and arbitrates the addresses using the global address crossbar to determine a system board that processes the data at the addresses. The multiprocessor system notifies a result of the determination to the respective system boards to symmetrically and uniformly assign processing to all the CPUs.
In the present invention, as explained below, the multiprocessor system including the SMP realizes reduction in latency of read from a memory to realize improvement of processing performance of the system. As a conventional technique related to the present invention, there is, for example, an invention described in Japanese Patent Application Laid-Open No. 2001-184321.
In a system including a CPU having a large-scale cache, latency of read from a memory substantially affects processing performance of the system. When latency is short, processing performance of the system is improved.
The multiprocessor system including the SMP has a characteristic that it is possible to read data from a memory from respective nodes with equal latency. However, latency at the time of read from a memory physically close to the multiprocessor system is worse than that in a small-sized multiprocessor system of a non-SMP structure.
According to optimization of a program, a CPU of each of the nodes is capable of preferentially using a memory of the own node. However, this advantage is not obtained in a large-scale multiprocessor system of an SMP structure.
Therefore, in the multiprocessor system including the SMP, it can be expected that processing performance of the system is improved simply by reducing latency of read from the memory of the own node.
In order to realize reduction in latency of read from a memory, it is important to reduce latency of respective modules. It is also possible to realize the reduction in latency by reducing a queuing time of processing. This is because, when there is queuing of processing, latency of the entire system is determined by longer latency.
A queuing time caused by memory read processing executed by each of the system boards included in the SMP will be explained with reference to FIG. 10.
In the memory read processing, as shown in FIG. 10, the slave memory read data queue 6 performs processing for queuing read-out of data from the memory 2 and adoption/non-adoption notification for speculative execution of a read command from the pipeline 3 (notification indicating whether data is transmitted or discarded).
In this case, when a delay due to latency of a global address based on processing of the global address crossbar 8 is large, memory data arrives at the slave memory read data queue 6 earlier. The adoption/non-adoption notification of speculative execution of the read command arrives at the slave memory read data queue 6 later. Conversely, when a delay due to data read latency of the memory 2 is large, the adoption/non-adoption notification of speculative execution of the read command arrives at the slave memory read data queue 6 earlier. The memory data arrives at the slave memory data queue 6 later.
As it is seen from the above, in the conventional technique, in the memory read processing executed by each of the system board included in the SMP, a queuing time is inevitably caused in the slave memory read data queue 6. Therefore, there is a problem in that it is impossible to improve processing performance of the system.