Massively parallel processing (“MPP”) systems may have tens and even hundreds of thousands of nodes connected via a communications mechanism. Each node may include one or more processors (e.g., an AMD Opteron processor), memory (e.g., between 1-32 gigabytes), and a communications interface (e.g., HyperTransport technology) connected via a network interface controller (“NIC”) to a router with router ports. Each router may be connected via its router ports to some number of other routers and then to other nodes to form a routing topology (e.g., torus, hypercube, and fat tree) that is the primary system network interconnect. Each router may include routing tables specifying how to route incoming packets from a source node to a destination node. The nodes may be organized into modules (e.g., a board) with a certain number (e.g., 4) of nodes and routers each, and the modules may be organized into cabinets with multiple (e.g., 24) modules in each cabinet. Such systems may be considered scalable when an increase in the number of nodes results in a proportional increase in their computational capacity. An example network interconnect for an MPP system is described in Alverson, R., Roweth, D., and Kaplan, L., “The Gemini System Interconnect,” 2010 IEEE Annual Symposium on High Performance Interconnects, pp. 83-87, Mountain View, Calif., Aug. 18-20, 2010, which is hereby incorporated by reference.
The nodes of an MPP system may be designated as service nodes or compute nodes. Compute nodes are primarily used to perform computations. A service node may be dedicated to providing operating system and programming environment services (e.g., file system services, external Input/Output (“I/O”), compilation, editing, etc.) to application programs executing on the compute nodes and to users logged in to the service nodes. The operating system services may include I/O services (e.g., access to mass storage), processor allocation services, program launch services, log in capabilities, and so on. The service nodes and compute nodes may employ different operating systems that are customized to support the processing performed by the node.
An MPP system may include a supervisory system comprising a hierarchy of controllers for monitoring components of the MPP system as described in U.S. Patent Application No. 2008/0134213, entitled “Event Notifications Relating to System Failures in Scalable Systems,” filed on Sep. 18, 2007, which is hereby incorporated by reference. At the lowest level of the hierarchy, the supervisory system may include a controller associated with each node that is implemented as software that may execute on the node or on special-purpose controller hardware. At the next lowest level of the hierarchy, the supervisory system may include a controller for each module that may be implemented as software that executes on special-purpose controller hardware. At the next lowest level of the hierarchy, the supervisory system may include a controller for each cabinet that also may be implemented in software that executes on special-purpose controller hardware. The supervisory system may optionally include other levels of controllers for groups of cabinets. At the top of the hierarchy is a controller designated as the supervisory controller or system management workstation, which provides a view of the overall status of the components of the multiprocessor system. The hierarchy of controllers forms a tree organization with the supervisory controller being the root and the controllers of the nodes being the leaf controllers. Each controller communicates between its parent and child controller using a supervisory communication network that is independent of (or out of band from) the primary system network interconnect. For example, the supervisory communication network may be a high-speed Ethernet network.
The controllers monitor the status of the nodes, network interface controllers, and routers. A leaf controller (or node controller) may monitor the status of the hardware components of the node and the system services executing on the node. The next higher level controller (module controller or L0 controller) may monitor the status of the leaf controllers of the nodes of the module, power to the module, and so on. The next higher level controller (cabinet controller or L1 controller) may monitor the status of the next lower level controllers, power to the cabinet, cooling of the cabinet, and so on.
FIG. 1 is a block diagram that illustrates an example controller hierarchy of a supervisory system. The controller hierarchy 100 includes a root or supervisory controller 101. The supervisory controller (also referred to as a system management workstation) is the parent controller for cabinet controllers 104. A cabinet physically contains the modules. Each cabinet controller is a parent controller of module controllers 105 within the cabinet. A module is a physical grouping of a number (e.g., four) of nodes. Each module controller is a parent controller of node controllers 106 on a module. The lines between the controllers represent the logical communications path between the controllers, which may be implemented as a supervisory communications network that is out of band from the primary system network interconnect, which is not shown in FIG. 1.
FIG. 2 is a block diagram that illustrates an example network interface and routing device of a network interconnect. A network device 200 includes two network interface controllers (“NICs”) 210 and 211. Each network interface controller is connected via a HyperTransport connection 220 or a HyperTransport connection 221 to a node (not shown). The network interface controllers are connected to a router 230 via a netlink 260. The network device also includes a supervisory component 240 with a connection to a local controller 250. The packets from the network interface controllers are routed via the netlink to the router over a router input selected for load balancing purposes. The router routes the packets to one of 40 network connections. Each packet may comprise a variable number of fixed-sized flow control units, referred to as “flits.” Requests for services are generally sent as request packets, and generally each request packet has a corresponding response or reply packet indicating the response to the corresponding request.
FIG. 3 is a block diagram that illustrates the connections of an example network device. The network device 300 includes 40 router ports 301 for connection to other routers in the network interconnect. The network device includes four links of four ports each in the x and z directions and two links of four ports each in the y direction.
FIG. 4 is a block diagram that illustrates the layout of an example router. The router 400 comprises 48 tiles arranged into a matrix of six rows and eight columns. The router provides 40 connections to the network and 8 connections to the network interface controllers via the network link. Each tile 410 includes an input buffer 411, routing logic 412, a row bus 413, row buffers 414, an 8×6 switch 415, a column bus 416, output buffers 417, and an output multiplexor 418. The packets are received at a tile via the router port connected to the input buffer and processed on a flit-by-flit basis by the routing logic. During each cycle of the tile, the routing logic retrieves a flit (if available) from the input buffer and routes the flit via a line of the row bus to one of the row buffers of a tile in the same row. If that row buffer is full, then the routing logic leaves the flit in the input buffer and repeats the process during the next cycle. At each cycle, flits in the row buffers are routed via the 8×6 switch to an output buffer in a tile in the same column. During each cycle, the output logic sends a flit from an output buffer to the router port associated with that tile. The tiles of the routers and the network interface controllers are referred to as “network components.”
The routing logic of the tiles routes the flits based on a routing table for each of the tiles. Each routing table contains 32 entries, and each entry includes a match and a mask. The routing logic at an input port of a tile applies the match of each entry in sequence to each packet to find the first matching entry. The routing logic then routes the packet (on a flit-by-flit basis) to an output port identified by the mask of that matching entry. Other router architectures may have one or more routing tables per router and may not be tile-based. Each routing table may also have any number of entries (e.g., 64 or 128).
The routing tables of a network interconnect are typically initialized to avoid deadlocks and to ensure proper ordering of packets. A deadlock may occur, for example, when routers along a routing path cannot send a flit because other routers along the routing path are full and cannot send a flit because other routers are full. There are well-known routing algorithms for avoiding deadlocks such as that described in U.S. Pat. No. 5,533,198, entitled “Direction Order Priority Routing of Packets Between Nodes in a Networked System.” When routed through a network, certain types of packets need to have their order of delivery guaranteed. For example, a program may store data in a remote memory location and later load that data from that same remote memory location. To store the data, the processor executing the program sends a store request via the network to the remote memory location. To load the data, the processor sends a load request via the network to the remote memory location. If the requests were to travel on different routes through the network, it might be possible (e.g., depending on network congestion) for the load request to arrive at the remote memory location before the store request. In such a case, the load request would load the old value from the remote memory location. Networks employ various techniques to ensure that “ordered packets” are received in the same order as they were sent. For example, a network may ensure that ordered packets each travel through the same route. Unordered packets, in contrast, do not depend on their ordering for proper functioning. For example, two load requests to the same memory location will function properly regardless of which is received first (assuming no intervening store request).
Links of a network can fail for various reasons. For example, a link may simply break or become disconnected at one end, or the router to which a link is connected may lose power. Whenever a link fails, the network is no longer fully connected. In such a case, ordered packets may not be able to travel on the same route. Various techniques have been used to recover from failed links. One technique terminates all jobs executing on each node, then restarts the system with new routes that avoid failed links and restarts the terminated job, which may continue from a checkpoint. Another technique may have redundant links, and when a link fails, the technique routes packets onto the redundant link. However, if the redundant link also fails, then another approach needs to be used such as restarting the system.