1. Field of the Invention
This invention relates to directory-based, shared-memory, scaleable multiprocessor computer systems and, more particularly, to methods and apparatus for avoiding transaction deadlock at any node, even if transaction flow control between nodes is not implemented.
2. Description of Related Art
Computers have internal clock circuits which drive computing cycles. The faster the clock, the faster the computer can complete assigned tasks. In the early 1980s, the average clock speed of readily-available microprocessors was about 4 megahertz. In 1996, microprocessors having clock speeds over 200 megahertz are common. Clock speed increases generally follow increases in transistor density. In the past, the number of transistors per unit area has doubled every 18 months. However, processor clock speed increases attributable to increases in transistor density are expected to slow. Increased transistor density requires more effective cooling to counteract the heat generated by increased power dissipation. In addition, the need to densely pack components in order to avoid long wire lengths and associated transmission delays only exacerbates the heat problem.
Given the fundamental power dissipation problem posed by ultra-high clock speeds, scaleable, parallel computer architectures which utilize multiple processors are becoming increasingly attractive. By the term "scaleable", it is meant that multiprocessor systems may be initially constructed from a few processors, and then expanded at a later date into powerful systems containing dozens, hundreds, or even thousands of processors. Massively-parallel computers constructed from relatively-inexpensive, high-volume microprocessors are being manufactured that are capable of providing supercomputer performance. In fact, for certain applications such as data base management, multiple-processor computers systems are capable of providing performance that is vastly superior to that provided by systems constructed with a single powerful processor, in spite of the increased overhead associated with the parallel systems.
As more efficient system software is written and as parallel system architectures mature, the power and usefulness of massively-parallel computers will increase dramatically. In order to reduce the bottlenecks associated with main memory access, massively parallel systems are being manufactured and designed that distribute main memory among individual processors, or among system nodes having multiple processors. In order to speed memory accesses, each processor within a parallel system is typically equipped with a cache. It is generally conceded that the larger the cache associated with each processor, the better the system performance.
Multi-processor, multi-cache computer systems with cache-coherent memories can be based on several cache architectures such as Non-Uniform Memory Architecture (NUMA) or Cache-Only Memory Architecture (COMA). For both types of architecture, cache-coherence protocols are required for the maintenance of coherence between the contents of the various caches. For the sake of clarification, the term "cache" shall mean only a second-level cache directly associated with a processor. The term "cache memory", on the other hand, shall apply only to the main memory within a node of a COMA-type system that functions as a cache memory, to which all processors within that node have equal access, and that is coupled directly to the local interconnect.
FIG. 1 is a block architectural diagram of a parallel computer system having NUMA architecture. Computer system 100 includes a plurality of subsystems (also known as nodes) 110, 120, . . . 180, intercoupled via a global interconnect 190. Each node is assigned a unique network node address. Each subsystem includes at least one processor, a corresponding number of memory management units (MMUs) and caches, a main, a global interface (GI) and a local-node interconnect (LI). For example, node 110 includes processors 111a, 111b . . . 111i, MMUs 112a, 112b, . . . 112i, caches 113a, 113b, . . . 113i, main memory 114, global interface 115, and local-node interconnect 119.
For NUMA architecture, the total physical address space of the system is distributed among the main memories of the various nodes. Thus, partitioning of the global address (GA) space is static and is determined before at system boot-up (i.e., before the execution of application software). Accordingly, the first time node 110 needs to read or write to an address location outside its pre-assigned portion of the global address space, the data has to be fetched from a global address in one of the other subsystems. The global interface 115 is responsible for tracking the status of data associated with the address space of main memory 114. The status information of each memory location is stored as a memory tag (M-TAG). The M-TAGs may be stored within any memory dedicated for that use. For example, the M-TAGS may be stored as a two-bit data portion of each addressable memory location within the main memory 114, within a separate S-RAM memory (not shown), or within directory 116. Data from main memories 114, 124, . . . 184 may be stored in one or more of caches 113a, . . . 113i, 123a, . . . 123i, and 183a, . . . 183i. In order to support a conventional directory-based cache coherency scheme, nodes 110, 120, . . . 180 also include directories 116, 126, . . . 186 coupled to global interfaces 115, 125, . . . 185, respectively.
Since global interface 115 is also responsible for maintaining global cache coherency, global interface 115 includes a hardware and/or software implemented cache-coherency mechanism for maintaining coherency between the respective caches and main memories of nodes 110, 120, . . . 180. Cache coherency is essential in order for the system 100 to properly execute shared-memory programs correctly.
The description of a COMA-type computer system will be made with reference to FIG. 2. The architecture of a Cache-Only Memory Architecture (COMA) parallel computer system is similar in many respects to that of a NUMA system. However, what were referred to as main memories 114, 124, . . . 184 for NUMA architecture will be referred to as cache memories 214, 224, . . . 284 for COMA architecture. For a COMA system, responsibility for tracking the status of total addressable space is distributed among the respective M-TAGS and directories of the various nodes (e.g. 210, 220 . . . 280). Partitioning of the cache memories (e.g., 214, 224, . . . 284) of the COMA-type computer system 100 is dynamic. That is to say that these cache memories function as attraction memory wherein cache memory space is allocated in page-sized portions during execution of software as the need arises. Nevertheless, cache lines within each allocated page are individually accessible.
Thus, by allocating memory space in entire pages in cache memories 214, 224, . . . 284, a COMA computer system avoids capacity and associativity problems that are associated with caching large data structures in NUMA systems. In other words, by simply replacing the main memories of the NUMA system with similarly-sized page-oriented cache memories, large data structures can now be cached in their entirety.
For COMA systems, the global interface 215 has a two-fold responsibility. As in the NUMA system, it is responsible for participating in the maintenance of global coherency between second-level caches (e.g., 213a, . . . 213i, 223a, . . . 223i, and 283a, . . . 283i). In addition, it is responsible for tracking the status of data stored in cache memory 214 of node 210, with the status information stored as memory tags (M-TAGs). Address translator 217 is responsible for translating local physical addresses (LPAs) into global addresses (GAs) for outbound data accesses and GAs to LPAs for incoming data accesses.
In this implementation, the first time a node (e.g., node 210) accesses a particular page, address translator 217 is unable to provide a valid translation from a virtual address (VA) to a LPA for node 210, resulting in a software trap. A trap handler (not shown) of node 210 selects an unused page in cache memory 214 to hold data lines of the page. M-TAGs of directory 216 associated with the page are initialized to an "invalid" state, and address translator 217 is also initialized to provide translations to/from this page's local LPA from/to the unique GA which is used to refer to this page throughout the system 200.
Although a COMA system is more efficient at caching larger data structures than a cache-coherent NUMA system, allocating entire pages of cache memory at a time in order to be able to accommodate large data structures is not a cost effective solution for all access patterns. This is because caching entire pages is inefficient when the data structures are sparse or when only a few elements of the structure are actually accessed.
In order to provide a better understanding of the operation and architecture of the global interface for both NUMA-type and COMA-type systems, a description of a conventional global interface will be provided with reference to FIG. 3. When structures of FIG. 1 are referred to, the reference also applies to the corresponding structures of FIG. 2. Each global interface (e.g., GI 115 of FIG. 1 or GI 215 of FIG. 2) includes a slave agent (SA), a request agent (RA), and a directory agent (DA). Examples of such agents are SA 315a, RA 315b, and DA 315c. Each DA is responsible for maintaining its associated directory.
The status of cached copies from nodes 110, 120, . and 180 are recorded in directories 116, 126, . . . and 186, respectively. As previously explained, each copy is identified as having one of four status conditions, shared (S), owned (O), modified (M) or invalid (I). A shared state indicates that there are other copies in other nodes, that no write-back is required upon replacement, and that only read operations can be made to the location. An owned state indicates that there may be other copies in other nodes, that a write-back is required upon replacement, and that only read operations can be made to the location. A modified state indicates that there are no shared copies in other nodes and that the location can be read from or written to without consequences elsewhere. An invalid state indicates that the copy in the location is now invalid and that the required data will have to be procured from a node having a valid copy.
An RA provides a node with a mechanism for sending read and write requests to the other subsystems. An SA is responsible for responding to requests from the DA of another node.
Requests for data and responses to those requests are exchanged by the respective agents between nodes 110, 120, . . . and 180 in the form of data/control packets, thereby enabling each node to keep track of the status of all data cached therein. The status information regarding cache lines in caches 113a . . . 112i, 123a . . . 123i, and 183a . . . 183i are stored in directories 116, 126, . . . and 186, respectively. The data/control packets are transmitted between nodes via the global interconnect 190. Transmissions of data/control packets are managed through a conventional networking protocol, such as the collision sense multiple access (CSMA) protocol, under which nodes 110, 120, . . . and 180 are loosely coupled to one another at the network level of the protocol. Thus, while the end-to-end arrival of packets is guaranteed, arrival of packets in the proper order may not be. Cases of out-of-order packet arrival at nodes 110, 120, . . . and 180 may result in what are termed "corner cases". A corner case occurs when an earlier-issued but later-received request must be resolved before a later-issued but earlier-received request is resolved. If such a case is not detected and resolved in proper sequence, cache coherency may be disrupted.
Another problem related to the transmission of read and write requests is preventing system deadlock caused by more requests arriving at a node than the node can simultaneously process. Let us assume that any node acting in its capacity as a home node can process y number of home-directed requests simultaneously, and any node acting in its capacity as a slave node can process z number of slave-directed requests simultaneously. When y number of home requests are being processed by a node, that node has reached it capacity for handling home-directed requests. Likewise, when z number of slave-directed requests are being processed by a node, that node has reached its capacity for handling slave-directed requests. In other words, that node cannot begin processing other like requests until at least one of those undergoing processing is complete. If a flow control protocol were implemented which signaled the system to stop issuing transaction requests due to a destination node having reached its request processing capacity, then the global interconnect may become so overloaded with protocol transmissions that the system may reach a state where it is incapable of making any further progress. Such a state is known as system deadlock. If no flow control were implemented, protocol errors would most likely result as requests were simply dropped.
In order to manage the ongoing traffic of issued requests and responses to those requests in a parallel computing system in such a manner so as not to precipitate a condition of system deadlock caused by issuance of too many requests to a single node, system designers have heretofore relied on complex flow control protocols to manage transaction flow. Such a solution has several drawbacks. The first is the sheer complexity of designing a flawless transaction control system. The second is that a transaction control system requires overhead. Such overhead might be additional communication channels, additional memory dedicated to storing the control system software, and additional processor utilization to execute the control system software. In addition to adding to system overhead, implementation of a software-controlled traffic control system will invariably result in slower processing speeds as the system processes the traffic control parameters and implements the traffic control protocol.
What is needed is a more efficient way to manage read and write request traffic flow in a parallel computer system which does not require additional system operational overhead, and which will not impede information flow on the global interconnect.