In one conventional high performance computing arrangement, a user is allocated multiple nodes in a network to perform a computing task. The nodes execute application threads involved in the task, and coordinate the execution of the threads by passing network messages among themselves. Given the complexity of this computing arrangement, it is possible for the network, nodes, and/or application threads to fail. Accordingly, software agents are provided in the nodes in the network to periodically request the storing of information related to the nodes' internal states in order to facilitate recovery in the event of such failure.
In this conventional arrangement, the state information obtained by the software agents may be of limited utility unless, when the state information is obtained, the network and applications are in a consistent, quiescent state. Unfortunately, this may delay the obtaining of such information until after the network and applications have entered a consistent, quiescent state.
Additionally, in this conventional arrangement, the state information may be stored in network attached hard disk storage that is remote from one or more of the nodes. The amount of access time associated with such storage may be higher than is desirable, and therefore, may increase the amount of time involved in storing and/or retrieving such information beyond that which is desirable.
Also, after a failure-related restart of an application, the assignment of one or more computing tasks to one or more nodes in the network may have changed relative to that which may have prevailed prior to the restart. In this conventional arrangement, the determination of and compensation for these differences may involve a global communications to reconfigure the network and also may involve intervention of application level software.
Furthermore, in this conventional arrangement, an application may issue a command intended for another node by indicating a logical address for that node. Software processes translate the logical address into a physical address to be used to communicate with the node. The software processes may perform the translation based upon rigidly predetermined algorithmic address assignments and/or one or more look up tables in which all possible logical and physical addresses are stored, in full, in memory. Unfortunately, this may consume undesirably large amounts of host processor bandwidth and/or memory space, and may result in address translation operations being carried out more slowly than desired. It may also introduce undesirable inflexibility into address assignments, and may make node-to-computing-task re-assignment (e.g., of the type discussed above) more difficult than is desired.
Although the following Detailed Description will proceed with reference being made to illustrative embodiments, many alternatives, modifications, and variations thereof will be apparent to those skilled in the art. Accordingly, it is intended that the claimed subject matter be viewed broadly.