Massively parallel processing (“MPP”) systems may have tens of thousands of nodes connected via a communications mechanism. Each node may include one or more processors (e.g., an AMD Opteron processor), memory (e.g., between 1-8 gigabytes), and a communications interface (e.g., HyperTransport technology) connected via a network interface controller (“NIC”) to a router with router ports. Each router may be connected via its router ports to some number of other routers and then to other nodes to form a routing topology (e.g., torus, hypercube, and fat tree) that is the primary system network interconnect. Each router may include routing tables specifying how to route incoming packets from a source node to a destination node. The nodes may be organized into modules (e.g., a board) with a certain number (e.g., 4) of nodes and routers each, and the modules may be organized into cabinets with multiple (e.g., 24) modules in each cabinet. Such systems may be considered scalable when an increase in the number of nodes results in a proportional increase in their computational capacity. An example network interconnect for an MPP system is described in Alverson, R., Roweth, D., and Kaplan, L., “The Gemini System Interconnect,” 2010 IEEE Annual Symposium on High Performance Interconnects, pp. 83-87, Mountain View, Calif., Aug. 18-20, 2010, which is hereby incorporated by reference.
The nodes of an MPP system may be designated as service nodes or compute nodes. Compute nodes are primarily used to perform computations. A service node may be dedicated to providing operating system and programming environment services (e.g., file system services, external I/O, compilation, editing, etc.) to application programs executing on the compute nodes and to users logged in to the service nodes. The operating system services may include I/O services (e.g., access to mass storage), processor allocation services, log in capabilities, and so on. The service nodes and compute nodes may employ different operating systems that are customized to support the processing performed by the node.
An MPP system may include a supervisory system comprising a hierarchy of controllers for monitoring components of the MPP system as described in U.S. Patent Application No. 2008/0134213, entitled “Event Notifications Relating to System Failures in Scalable Systems,” filed on Sep. 18, 2007, which is hereby incorporated by reference. At the lowest level of the hierarchy, the supervisory system may include a controller associated with each node that is implemented as software that may execute on the node or on special-purpose controller hardware. At the next lowest level of the hierarchy, the supervisory system may include a controller for each module that may be implemented as software that executes on special-purpose controller hardware. At the next lowest level of the hierarchy, the supervisory system may include a controller for each cabinet that also may be implemented in software that executes on special-purpose controller hardware. The supervisory system may then include other levels of controllers for groups of cabinets referred to as slices, groups of slices referred to as sections, and so on. At the top of the hierarchy is a controller designated as the supervisory controller or system management workstation, which provides a view of the overall status of the components of the multiprocessor system. The hierarchy of controllers forms a tree organization with the supervisory controller being the root and the controllers of the nodes being the leaf controllers. Each controller communicates between its parent and child controller using a supervisory communication network that is independent of (or out of band from) the primary system network interconnect. For example, the supervisory communication network may be a high-speed Ethernet network.
The controllers monitor the status of the nodes, network interface controllers, and routers. A leaf controller (or node controller) may monitor the status of the hardware components of the node and the system services executing on the node. The next higher level controller (module controller or L0 controller) may monitor the status of the leaf controllers of the nodes of the module, power to the module, and so on. The next higher level controller (cabinet controller or L1 controller) may monitor the status of the next lower level controllers, power to the cabinet, cooling of the cabinet, and so on.
FIG. 1 is a block diagram that illustrates an example controller hierarchy of a supervisory system. The controller hierarchy 100 includes a root or supervisory controller 101. The supervisory controller is the parent controller for the section controllers 102. A section is a grouping of slices. Each section controller is a parent controller of slice controllers 103. A slice is a grouping of cabinets. Each slice controller is a parent controller of cabinet controllers 104. A cabinet physically contains the modules. Each cabinet controller is a parent controller of module controllers 105 within the cabinet. A module is a physical grouping of a number (e.g., four) of nodes. Each module controller is a parent controller of node controllers 106 on a module. The lines between the controllers represent the logical communications path between the controllers, which may be implemented as a supervisory communications network that is out of band from the primary system network interconnect, which is not shown in FIG. 1.
FIG. 2 is a block diagram that illustrates an example network interface and routing device of a network interconnect. A network device 200 includes two network interface controllers (“NICs”) 210 and 211. Each network interface controller is connected via a HyperTransport connection 220 or a HyperTransport connection 221 to a node (not shown). The network interface controllers are connected to a router 230 via a netlink 260. The network device also includes a supervisory component 240 with a connection to a local controller 250. The packets from the network interface controllers are routed via the netlink to the router over a router input selected for load balancing purposes. The router routes the packets to one of 40 network connections. Each packet may comprise a variable number of fixed-sized flow control units, referred to as “flits.”
FIG. 3 is a block diagram that illustrates the connections of an example network device. The network device 300 includes 40 router ports 301 for connection to other routers in the network interconnect. The network device includes four links of four ports each in the x and z directions and two links of four ports each in the y direction.
FIG. 4 is a block diagram that illustrates the layout of an example router. The router 400 comprises 48 tiles arranged into a matrix of six rows and eight columns. The router provides 40 connections to the network and eight connections to the network interface controllers via the network link. Each tile 410 includes an input buffer 411, routing logic 412, a row bus 413, row buffers 414, an 8×6 switch 415, a column bus 416, output buffers 417, and output multiplexor 418. The packets are received at a tile via the router port connected to the input buffer and processed on a flit-by-flit basis by the routing logic. During each cycle of the tile, the routing logic retrieves a flit (if available) from the input buffer and routes the flit via a line of the row bus to one of the row buffers of a tile in the same row. If that row buffer is full, then the routing logic leaves the flit in the input buffer and repeats the process during the next cycle. At each cycle, flits in the row buffers are routed via the 8×6 switch to an output buffer in a tile in the same column. During each cycle, the output logic sends a flit from an output buffer to the router port associated with that tile. The tiles of the routers and the network interface controllers are referred to as “network components.”
Depending on the characteristics of the jobs executing on the compute nodes, the network interconnect may not be able to transmit requests from an originating node to a destination node and receive a corresponding response in a timely manner. For example, if many nodes (e.g., 999 nodes in a 1,000 node network) executing an execution thread of the job rapidly send requests to a single destination node also executing an execution thread of the job, then the buffers of the tiles that lead to the destination node may become full. If the buffers are full, then the routing logic of the tiles will spend cycles waiting for the buffer to be no longer full. If the network interconnect cannot deliver packets in a timely manner to even a single node, the speed at which all the jobs execute on the nodes of the network can be negatively impacted.