1. Field of the Present Invention
The present invention generally relates to the field of microprocessor based computers and more specifically to memory subsystem micro architecture in a multiprocessor system.
2. History of Related Art
Typical multiprocessor computer systems, until recently, have been designed using a set of discrete, separately packaged microprocessors. The set of microprocessors were typically interconnected via a shared or bi-directional bus commonly referred to as a host bus or local bus. The shared host bus architecture had the advantage of freeing up more pins for other signals in pin-limited microprocessor designs. In addition, the shared bus architecture implied a single active address in any given cycle that simplified arbitration and coherency management. Unfortunately, the shared bus, multiprocessor architecture requires a complex protocol for requesting and granting the system bus, retrying operations, and so forth. The complexity and handshaking inherent in the bus protocols implied by shared bus systems significantly hampers the ability to pipeline processor operations that require use of the local bus (i.e., any operation that accessed memory below the L1 cache level of the system). As fabrication technology has progressed to the point that single chip, multiprocessor devices have become a reality, little attention has been devoted to the possible architectural advancements afforded by the elimination of pin count considerations that constrained multi-chip designs. Accordingly, much of the potential for improved performance offered by single chip devices has gone unfulfilled.
The problems identified above are in large part addressed by a multiprocessor system implemented with unidirectional address and data busses between the set of processors and a memory subsystem driven by a single arbiter and a unified pipeline through which all memory subsystem operations are passed. By using a single point of arbitration, the invention greatly simplifies the micro-architecture of the memory subsystem. This simplification in architecture enables a high degree of memory subsystem operation pipelining that can greatly improve system performance.
Broadly speaking, a first embodiment of the invention emphasizing a single point of coherency arbitration and coherency enforcement includes a memory subsystem for use with a multiprocessor computer system. The memory subsystem includes an operation block adapted for queuing an operation that misses in an L1 cache of a multiprocessor. The multiprocessor is comprised of a set of processors, preferably fabricated on a single semiconductor substrate and packaged in a single device package. The memory subsystem further includes an arbiter that is configured to receive external snoop operations from a bus interface unit and a queued operation from the operation block. The arbiter is configured to select and initiate one of received operations. Coherency is maintained by forwarding the address associated with the operation selected by the arbiter to each of a plurality of coherency units. In this manner, external and internal snoop addresses are arbitrated at a single point to produce a single subsystem snoop address that is propagated to each coherency unit. Preferably, the operation block includes a load miss block suitable for queuing load type operations and a store miss block suitable for queuing store type operations. In one embodiment, the subsystem includes a unidirectional local interconnect suitable for connecting the memory subsystem and the set of processors. The coherency units preferably include the L1 cache units of the set of processors, the operation block queues, and each stage of a memory subsystem pipeline.
The first embodiment of the invention further contemplates a method of maintaining coherency in a multiprocessor computer system in which an external snoop operation is received via a system bus and an internal operation is received from the operation block. An arbitration takes place between the external and internal operations. The arbitration selects and initiates one of the operations and thereby generates a single snoop address. This single snoop address is the broadcast to each of the coherency units to generate a plurality of snoop responses. Preferably the arbitration of the operations is resolved according to a fairness algorithm such as a round robin algorithm. In one embodiment, the plurality of snoop responses are forwarded to a snoop control block unit that is adapted to monitor and modify operations queued in the operation block.
A second embodiment of the invention emphasizing resources for managing queued operations to eliminate retry mechanisms contemplates a multiprocessor computer system including a set of processors. Each processor in the set includes an execution unit for issuing operations and a processor queue suitable for queuing previously issued and still pending operations. The multiprocessor further includes means for forwarding operations issued by the processor to the processor queue and to an operation block queue of a memory subsystem that is connected to the multiprocessor. The depth of (i.e., the number of entries in) the operation block queue matches the depth of the processor queue. The processor queue, when full, inhibits the processor from issuing additional operations. In this manner, an operation issued by the processor is guaranteed an available entry in the operation block queue of the memory subsystem thereby eliminating the need for operation retry circuitry and protocols such as handshaking. Preferably, each processor queue includes a processor load queue and a processor store queue and the operation block queue includes a load queue and a store queue. In this embodiment, the depth of each of the processor load and store queues matches the depth of the operation block load and store queues respectively. In the preferred embodiment, the operation block is comprised of a load miss block that includes the operation block load queue and a store miss block that includes the operation block store queue. Still further preferably, the operation block store queue includes a set of store queues corresponding to the set of processors and the operation block load queue includes a set of load queues corresponding to the set of processors. Each queue entry preferably includes state information indicative of the status of the corresponding entry.
The second embodiment of the invention further contemplates a method of managing operation queue resources in a multiprocessor computer system. The method includes queuing an operation in a processor queue and in an operation block queue of a memory subsystem and detecting when the processor queue lacks an available entry (i.e., the queue is full). In response to detecting a processor full condition, the processor is then prevented from issuing additional operations thereby assuring that issued operations are guaranteed an entry in the operation block queue. Preferably, the step of queuing includes queuing load operations and store operations separately and queuing operations from each processor separately. In one embodiment, the step of detecting the lack of an available entry includes interpreting status bits associated with each entry in the processor queue. Preferably, the status of an operation in the processor queue is the same as the status of a corresponding operation in the operation block queue.
A third embodiment of the invention emphasizing efficient critical word forwarding contemplates a multiprocessor computer system including a multiprocessor device preferably comprised of a set of processors, each including a respective L1 cache. The multiprocessor is preferably fabricated as a single device. The computer system includes a memory subsystem comprised of a load miss block adapted for queuing a load operation issued by a first processor that misses in an L1 cache of the first processor and a store miss block adapted for queuing store type operations. An arbiter of the memory subsystem is configured to receive queued operations from the load and store miss blocks and further configured to select and initiate one of the received operations. The subsystem further includes means for forwarding the address associated with the load miss operation to a lower level cache and means for receiving a hit/miss response from the lower level cache. In the preferred embodiment, the load miss block is adapted to detect the response from lower level cache and to request a bus interface unit to fetch data via a system bus if the lower level cache responds with a miss. The bus interface unit is configured to signal the load miss block when a first portion of the fetched data is available. In response thereto, the load miss block is configured to initiate a forwarding operation that returns the first portion of the data to the requesting processor if the forwarding operation can be initiated without displacing a valid load miss operation. The store and load miss block preferably each include separate store miss queues for each processor of the multiprocessor. The bus interface unit is preferably further configured to signal the load miss block when the entire granule (i.e., cache line) of requested data is available. The forwarding operation is preferably initiated if a first stage of a load miss block pipeline is invalid at some point after the first portion of data is available, but before the entire requested data is available.
The third embodiment of the invention still further contemplates a method of fetching data from a bus interface unit for reloading a cache. Initially, a bus interface unit is requested to fetch data via a system bus. A critical data signal is received by a load miss block from the bus interface unit indicating that a critical portion of the fetched data is available. The load miss block then determines if a forwarding operation may be initiated without displacing a valid operation. Next, depending upon the result of determining whether the forwarding operation may be initiated, the forwarding operation is either initiated or retried. In one embodiment, the bus interface unit is requested to fetch data in response to receiving a miss response from an L2 or lower level cache. Preferably, the method further includes successfully arbitrating the forwarding operation and sending the critical data to a requesting processor. After the entire line of fetched data has been forwarded to the bus interface unit, the entire line is reloaded into the L1 cache.
A fourth embodiment of the invention emphasizing efficient handling of local interventions (cache-to-cache transfers) contemplates a multiprocessor computer system including a set of processors connected to a memory subsystem via a local interconnect. The memory subsystem includes a load miss block suitable for queuing a first processor load operation that misses in an L1 cache of the first processor and a store miss block suitable for queuing store type operations. The subsystem further includes an arbiter suitable for receiving queued operations from the load and store miss blocks. The arbiter is further configured for selecting one of the received operations and initiating the selected operation. The subsystem further includes means for snooping the address associated with the first processor load operation when the first processor load operation is selected and initiated by the arbiter. The subsystem further includes a snoop control block adapted to receive a snoop response from a second processor associated with the memory subsystem. The snoop control block is further adapted to queue a store type operation in the store miss block if the snoop response from the second processor is modified. The subsystem is configured to link the store type operation with the first load operation when the store type operation is initiated. When the linked operations complete (together), the data associated with the store type operation, which is preferably written to an L2 or lower level cache, will also satisfy the first load operation. The local interconnect is preferably comprised of a unidirectional bus. In the preferred embodiment, the load and store blocks each include control pipelines with corresponding stages wherein each stage has its own validity information. In this embodiment the corresponding stages of the load miss and store miss blocks are linked by simultaneously validating a first stage of the load miss block when forwarding operation is initiated (i.e., when the forwarding operation wins arbitration by the arbiter). The output of the arbiter is preferably connected to a first stage of a memory subsystem pipeline. The snoop access and L2 access are preferably initiated when the operation enters the first stage of the pipeline. In the preferred embodiment, the depth of the pipeline is sufficient to determine the snoop response and L2 access response (i.e., hit or miss) by the time an operation has reached a last stage of the pipeline.
The fourth embodiment of the invention further contemplates a method of completing a load operation in a multiprocessor system in which, responsive to a first processor load operation that misses in an L1 cache of the first processor, the load operation address is snooped. When a modified snoop response from an L1 cache of a second processor is detected, a store type operation associated with the second processor is queued and forwarded to an arbiter. The store type operation is linked to the first processor load operation when the store type operation is selected and initiated by the arbiter. The data portion of the store type operation satisfies the first processor load operation when the store type operation completes. The step of linking the store type operation and the load operation preferably comprises validating the load operation in a first stage of the load miss block""s pipeline when the store type operation is initiated. The store type operation preferably reloads a lower level cache with the data in the modified entry of the L1 cache of the second processor and the load operation is preferably satisfied as the lower level cache is reloaded.
A fifth embodiment of the invention emphasizing data source arbitration contemplates a multiprocessor system that includes a set of processors connected to a memory subsystem via a local interconnect. The memory subsystem includes a load miss block adapted for queuing load type operations, a store miss block adapted for queuing store type operations, an arbiter configured to receive and arbitrate queued operations from the load and store miss blocks as well as operations directly from the set of processors, and means for reloading an L1 cache. The means for reloading the L1 cache reload the cache with data from a first data source via a reload data bus upon completion of a first operation arbitrated through the arbiter and means for reloading the L1 cache with data from a second data source via the reload data bus upon completion of a second operation arbitrated through the arbiter. In this manner, operations requiring a reload of L1 cache are arbitrated through a common arbiter regardless of the source of data required to complete the load request. Moreover, the data is reloaded via a common data bus regardless of the source of data thereby eliminating backend data arbitration. Preferably, the means for reloading the L1 cache are connected to an L2 cache and configured to reload the L2 cache with the reload data while the L1 cache is being reloaded such that the L2 data reload is synchronized with the L1 data reload. The source of data may be another L1 cache associated with the set of processors or a bus interface unit adapted for retrieving data from a system bus. In the preferred embodiment, the local interconnect comprises a unidirectional address bus connecting the set of processors to the memory subsystem. In one embodiment, the memory subsystem includes a memory subsystem pipeline connected to the output of the arbiter wherein an arbitrated operation completes when it reaches the last stage of the pipeline.
The fifth embodiment of the invention further contemplates a method of reloading an L1 cache in a multiprocessor device. A first operation that requires data from a first data source and a second operation that requires data from a second data source are forwarded to an arbiter. In response to the first operation being selected and initiated by the arbiter, the first operation is completed and the L1 cache reloaded from the first data source via a reload data bus. In response to the second operation being selected and initiated by the arbiter, the second operation is completed and the L1 cache reloaded with the data from the second data source via the data bus. Preferably, completing the first and second operations includes forwarding the operations to a memory subsystem pipeline where the first operation is completed and the reloading of the L1 cache occurs when the first operation reaches the last stage of the pipeline. The method may further include reloading an L2 cache with the reload data when the reload data completes such that the reload of the L1 cache and the reload of the L2 cache occurs concurrently.
A sixth embodiment of the invention emphasizing managing the ordering of multiple pending bus or global interventions (i.e., cache-to-cache transfers that traverse the system bus) contemplates a computer system including a first multiprocessor system connected to a system bus and adapted to forward first and second load requests to the system bus where the first load request precedes the second load request. The system further includes a second multiprocessor system connected to the system bus. The second multiprocessor system includes a memory subsystem comprised of first and second cache levels arranged such that an operation that retrieves data from the first cache level is arbitrated through the second cache level before the data becomes available to the system bus (i.e., the first cache level is a higher cache level than the second cache level). A snoop control state machine of the second multiprocessor system is adapted to stall arbitration of a second operation initiated in the second cache level responsive to the second load request until a first operation initiated in the first cache level responsive to the first load request has been arbitrated through the second cache level. In other words, new operations to a lower cache level are stalled until older operations pass the common arbitration point. Preferably, the first cache level includes a first operation queue for storing operations awaiting arbitration in the first cache level. Operations arbitrated in the first cache level are routed to a second store queue. In one embodiment, the memory subsystem further includes a second arbiter and a third cache level. In this embodiment, operations are stored in the second store queue pending arbitration in a second arbiter. In one embodiment, a first external snoop associated with the first load request hits to a modified cache line in the first cache and a second external snoop associated with the second load request hits to a modified cache line in the second cache level. The second multiprocessor is preferably adapted to send a data ready signal to the first multiprocessor when data associated with the first load request is available for transmission over the system bus. In the preferred embodiment, the data ready signal conveys no address information. The system is preferably configured to transfer the data associated with the first load request with a data-only bus transaction following the data ready signal.
The sixth embodiment of the invention further contemplates a method of managing interventions in a computer system. A first load request is initiated and forwarded to a system bus. A second load request is initiated after the first load request and forwarded to the system bus. The first operation generates a first operation in a first cache level of a multiprocessor and the second operation generates a second operation in the second cache level of the multiprocessor where the first cache level is higher than the second cache level. The second operation is stalled until the first operation arbitrates through the second cache level. The method preferably further includes generating a data ready signal when the data associated with the first load request is available to the system bus and transferring the data associated with the first load request via the system bus using a data only bus transaction.