1. Field of the Invention
The present invention relates to multi-processor computer systems and, more particularly, to the processing of memory access requests within a multi-processor computer system.
2. Description of the Related Art
Although computation speeds of conventional processors have increased dramatically, there is still a need for even faster computing. Large computational problems such as weather forecasting, fusion modeling, and aircraft simulation demand substantial computing power, far in excess of what can currently be supplied. While processor speed is improving as device speeds increase, the achieved performance levels are still inadequate to handle computationally complex problems.
To achieve high performance computing, a plurality of individual processors have been interconnected to form a multiprocessor computer system capable of providing parallel processing.
In a multiprocessor computing system, there are two sources of delay in satisfying processor memory requests. The first source of delay is the access time to the main memory, and the second source of delay is the communications delays imposed by an interconnection network that connects the various processors. If the bandwidth of the interconnection network is inadequate, the communication delays are greatly increased due to contention for the bandwidth.
One suggested solution to both the bandwidth and access time limitations of interconnection networks is the use of private caches memories at the individual processors. By properly selecting cache parameters, both the transfer ratios (the ratio of memory requests passed on to the main memory from the cache to initial requests made of the cache) and effective access times can be reduced. Unfortunately, private caches introduce a stale data problem (or multicache coherency problem) due to the multiple copies of main memory locations which may be present.
Another suggested solution involves the use of coherency directories. Coherency directories are generally large, separate blocks of memory which keep track of which processor in the multiprocessor computer system owns which lines of memory. Unfortunately, coherency directories can be expensive since additional memory is required and slow since coherency directories are typically structured in a table lookup format. Coherency directories can also severely degrade overall system performance since a memory call must be initiated for every address request.
More recently, shared memory multiprocessing system have interconnected processors (or groups of processors) by a single bus (e.g., an address bus). Unfortunately, as the processor speeds increase, the feasible number of processors that can be connected through a single bus decreases. One problem with using a bus is that performance degrades as more devices are added to the bus. This means that the bandwidth of a bus available to a processor actually shrinks as more processors are added to the bus.
FIG. 1A is a block diagram of a portion of a conventional multiprocessor computer system 100 illustrating typical snoop result paths between various processor groups. Computer system 100 includes a first processor group 110, a second processor group 120, a third processor group 130, an address interconnect 150, and a data interconnect 160. It should be noted whereas only three (3) processor groups are shown in FIG. 1, multiprocessor computer system 100 typically includes any suitable number of processor groups. Communication between the processor 110, 120 and 130 are provided by way of bidirectional buses 140 and 142. Each of processor groups 110, 120, and 130 includes a snoop results distributor and an address repeater. The address repeaters are used to communicate with address interconnect 150 by way of bidirectional buses 140 and 142. Generally, address interconnect 150 broadcasts address requests to every address repeater within computer system 100 whereas data interconnect 160 operates as a point to point router.
In operation, processor groups 110, 120 and 130 transmit their respective memory address requests directly to address interconnect 150. Address interconnect 150 will arbitrate any conflicting address requests and will simultaneously broadcast back to all groups of processors (including the original requester group) within system 100 the chosen requested address request. Once received, each processor group will generate and store a group snoop result in its own snoop results distributor. Each group's snoop results distributor will then broadcast to all other snoop results distributors in every processor group in system 100 their respective group snoop result. In this manner, every processor group within computer system 100 obtains the group snoop results of every other processor group. Thereafter, the processor group initiating the address request is directed to the appropriate memory location within the computer system 100. A conventional multiprocessor system utilizing a snoop system having such a snoop results distributor is exemplified by the STARFIRE system manufactured by Sun Microsystems, Inc. of Mountain View, Calif.
FIG. 1B is a flowchart illustrating a typical memory address request transaction in the conventional multiprocessor computer system 100 shown in FIG. 1A.
The conventional multiprocessor computer system memory address request transaction process 150 begins with an individual processor sending 10 an address request to the associated address repeater. As is known to those skilled in the art, at least one processor in a processor group will typically generate an address request to seek a specific block of memory. An address request typically will be associated with a specific memory command indicative of the purpose for which the block of memory is being requested by the processor. The address requester will forward 12 the received address request to the address interconnect associated with conventional multiprocessor computer system 100. The address interconnect, after appropriate conflict arbitration, will broadcast 14 the chosen address request to all address repeaters included within conventional multiprocessor computer system 100, including the address repeater associated with the original requester group of processors. Each associated address repeater will broadcast 16 the received address request to each of its associated individual processors. Each individual processor will in turn query 18 their respective memory cache to determine whether they have owned or shared a copy of the requested memory address. Based on the determining 18, each processor will generate an individual snoop result which is subsequently forwarded 20 to the snoop results distributor associated with the group of processors. The snoop results distributor then combines 22 all individual snoop results received from individual processors to form a group snoop result. The snoop result distributor then broadcasts 24 the group snoop result to all other snoop results distributors within computer system 100 since each snoop results distributor is capable of broadcasting and receiving the group snoop result from all other groups of processors within system 100.
Each snoop result distributor will combine 26 the group snoop results received from all other snoop results distributors within computer system 100 to form a global snoop result. The global snoop result contains all information relating to the ownership of the page of memory associated with the requested memory address for all groups of processors within system 100. Each snoop result distributor will forward 28 the global snoop result to all individual processors within its associated group of processors. Upon receipt of the global snoop result, the original requester processor will obtain 30 the requested page of memory.
As the number of processors added to the computer system increases, the amount of irrelevant data on the address bus degrades overall system performance. By way of example, as more processors are added to the computer system, at some point the maximum address bandwidth precludes any improvement in overall system performance at which no additional performance gained by adding more processors.
Thus, there is a need for techniques to reduce transmission of address requests between various processors in a multiprocessor computer system.