Massively parallel processing (“MPP”) systems may have tens of thousands of nodes connected via a communications mechanism. Each node may include a processor, a memory, and a communications interface to a network interconnect. The nodes of an MPP system may be designated as service nodes or compute nodes. Compute nodes are primarily used to perform computations. A service node may be dedicated to providing operating system and programming environment services (e.g., file systems, external I/O, compilation, editing, etc.) to applications that execute on the compute nodes and to users logged in to the service nodes. The operating system services may include I/O services (e.g., access to mass storage), processor allocation services, log in capabilities, and so on. The service nodes and compute nodes may employ different operating systems that are customized to support the processing performed by the node. Because MPP systems have thousands of nodes with many different possible points of failure, it is likely failures would not be uncommon. The monitoring for and reporting of these failures can be provided by an event notification system as described in U.S. Pat. No. 7,984,453, entitled “Event Notifications Relating to System Failures in Scalable Systems,” which is hereby incorporated by reference.
An application may execute in parallel with instances of the application executing on thousands of compute nodes. To execute such an application, a service node may allocate the needed compute nodes and then distribute the application to each allocated compute node, which proceeds to execute one or more instances of the application. The compute nodes may establish connections with other compute nodes so that control information can be sent between the compute nodes. An application level placement scheduler system (“ALPS system”) provides such allocation, distribution, and establishment. Aspects of the ALPS system are described in U.S. Patent Publication No. 2010-0121904, entitled “Resource Reservations in a Multiprocessor Computing Environment,” which is hereby incorporated by reference.
The ALPS system employs a tree structure organization for the connections between the compute nodes for sending control information. The compute nodes and the connections thus form a control tree with the compute nodes having a parent-child relationship. The control tree has a fan-out number (e.g., 32) that represents the maximum number of child nodes of a parent node. The fan-out number may be configurable. After the ALPS system allocates the compute nodes, it passes a placement list that identifies those compute nodes to one of the allocated compute nodes that is designated as the root node of the control tree. The ALPS system communicates with the allocated compute nodes through the compute node that is the root node of the control tree. The root node identifies nodes from the placement list to be its child nodes. The root node establishes connections with its child nodes and subscribes to receive notifications of failures relating to its child nodes from the event notification system. The root node provides the placement list to each of its child nodes so that each child node can establish a connection as a parent node of other nodes in the placement list as its child nodes. Each node may employ an algorithm to uniquely identify its child nodes. The application is also loaded by each of the nodes in the control tree. FIG. 1 illustrates an example organization of a control tree. The ALPS system 150 executing on a service node establishes a connection with control tree 100 via the root node 101. Control tree 100 includes nodes represented as circles and connections represented as lines between the circles. In this example, the fan-out number is 4 and the number of allocated nodes is 256. The root node 101 receives from the ALPS system an indication of the application and a placement list identifying the 256 nodes including the root node. The root node selects four nodes from the 255 remaining nodes (i.e., nodes other than the root node) to be child nodes 102-105, establishes a connection with each child node, and passes the placement list to each child node. Each child node then repeats a similar process for establishing connection with its child nodes. The sub-tree of each child node is assigned a block of compute nodes that includes the child node itself and the compute nodes that are to be descendant nodes of that child node. Child node 102 is assigned 64 compute nodes including child node 102 itself; child node 103 is assigned 64 compute nodes including child node 103 itself; child node 104 is assigned 64 compute nodes including child node 104 itself; and child node 105 is assigned 63 compute nodes including child node 105 itself. Each child node 102-105 selects by a selection algorithm four nodes from its block to be its child nodes, establishes a connection with its child nodes, and passes the placement list to its child nodes. This process is repeated at each child node until a child node, referred to as a leaf node, is the only node in a block.
The ALPS system and the nodes in the control tree subscribe to the event notification system to receive failure messages relating to the nodes. Each node may subscribe to receive events from its parent node and its child nodes. In addition, the ALPS system and the nodes may generate failure messages that may not be detected by the event notification system. When a failure is detected, the ALPS system directs the termination of the application. Each of the nodes is directed to stop executing the application and close its connections. The ALPS system then deallocates the nodes and other resources that were allocated to the application. Depending on the sophistication of the application, the application may need to have its execution restarted from the beginning or from a checkpoint. In either case, however, the ALPS system would need to again allocate nodes for the application and build a new control tree from scratch.
As the number of nodes allocated to an application increases, the chances of having a failure increases and the chances of an application completing without receiving a failure message decreases. As a result, significant computing resources may be expended terminating applications, allocating nodes, building new control trees, and restarting terminated applications.