Computing technology has advanced at a remarkable pace, with each subsequent generation of computing system increasing in performance, functionality, and storage capacity, often at reduced cost. However, despite these advances, many scientific and business applications still demand massive computing power, which can only be met by extremely high performance computing systems. One particular type of computing system architecture that is often used in high performance applications is a parallel processing computing system.
Generally, a parallel processing computing system comprises a plurality of homogeneous computing nodes and is configured with a distributed application. Some parallel processing computing systems may have hundreds or thousands of individual computing nodes. Each computing node is generally of modest computing power and typically includes one or more single-core processing units, or computing cores. As such, each computing node may be a computing system configured with an operating system and distributed application. The distributed application provides work for each computing node and is operable to control the workload of the parallel processing computing system. Generally speaking, the distributed application provides the parallel processing computing system with a workload that can be divided into a plurality of jobs. Typically, each computing node, or each computing core, is configured to process one job and therefore process, or perform, a specific function. Thus, the parallel processing architecture enables the parallel processing computing system to receive a workload and configure the computing nodes to cooperatively perform one or more jobs such that the workload supplied by the distributed application is processed substantially in parallel. Some parallel processing computing systems are generally based on the BlueGene computing system architecture as developed by International Business Machines (“IBM”) of Armonk, N.Y., and as is well known in the art.
Parallel processing computing systems have found application in numerous different computing scenarios, particularly those requiring high performance. For instance, airlines rely on parallel processing to process customer information, forecast demand, and decide what fares to charge. The medical community uses parallel processing computing systems to analyze magnetic resonance images and to study models of bone implant systems. As such, parallel processing computing systems typically perform most efficiently on work that contains several computations that can be performed at once, as opposed to work that must be performed serially. The overall performance of the parallel processing computing system is increased because multiple computing cores can handle a larger number of tasks in parallel than could a single computing system. Other advantages of some parallel processing computing systems include their scalable nature and their modular nature.
Conventional parallel processing computing systems are generally used to process work that often requires long runtimes. However, as the size of the parallel processing computing system increases, the mean time between failures for that parallel processing computing system typically increases faster than the decrease in the time required to process the work. More simply put, as more components are added to the parallel processing computing system, there is generally less time between failures of components even though the runtime for work decreases. Thus, when a parallel processing computing system reaches a large size, the average runtime for work often exceeds the mean time between failures for that parallel processing computing system. As such, work with long runtimes often fails to complete. One solution to this problem generally includes periodically checkpointing the parallel processing computing system such that the work may be restarted and continued from a known point. Checkpointing generally includes bringing the parallel processing computing system to a known state, saving that state, then resuming normal operations. Thus, time, money, and effort are typically expended that could otherwise be used for processing work.
However, conventional parallel processing computing systems are often large and expensive to implement, often using tens of thousands of homogenous nodes that are typically configured to perform only one task. One recent improvement has been to use hybrid computing nodes to implement hybrid architecture parallel processing computing systems. In hybrid architecture parallel processing computing systems, the hybrid computing nodes typically include a combination of a host and at least one accelerator element. Each host element typically includes at least one multithreaded processor and manages at least one accelerator element, while each accelerator element typically includes at least one multi-element processor to perform work. In many cases, each hybrid node includes a host element and multiple accelerator elements of a different processing architecture, which are specifically designed or optimized to handle specific problems or tasks. As such, the hybrid nodes of hybrid architecture parallel processing computing systems are typically able to process many tasks at once, thus processing work faster and more efficiently than the homogeneous nodes of conventional parallel processing computing systems. Therefore, hybrid architecture parallel processing computing systems typically provide many times the raw processing power of conventional parallel processing computing systems with fewer processors, less space, less heat, and lower overall cost.
However, checkpointing hybrid architecture parallel processing computing systems is often more complex than checkpointing conventional parallel processing systems. For example, a computing node of a conventional parallel processing system is often configured to perform one task, while a hybrid node of a hybrid architecture parallel processing system may be configured to perform multiple tasks simultaneously, including at least one task per core or thread of the multithreaded and/or multi-element processors of that hybrid node. Conventional applications for hybrid architecture parallel processing systems, however, may be unaware of the multithreaded and/or multi-element processors, and/or threads and elements thereof, and thus be unaware of where, exactly, a task is being processed. For example, one or more of the multithreaded and/or multi-element processors of a hybrid node may be configured as “shallow” processing units that simply execute simplified instruction streams, or “computation kernels,” without the aid of extraneous software. Shallow processing units are typically not configured with conventional operating systems or applications, and thus the conventional applications do not have complete control over the shallow processing units. The shallow processing units are typically controlled by at least one control unit which typically configures the shallow processing units with the computation kernels and manages the shallow processing units. For example, the at least one control unit may use shallow processing units to execute multiple instructions on a single data value, or otherwise perform generalized calculations, functions, or executions in a parallel manner.
As such, a hybrid node of the hybrid architecture parallel processing system configured with shallow processing units may execute individual computation kernels faster, as there is typically no application, operating system, or other management software, other than the at least one control unit, that requires processing time of those multithreaded and/or multi-element processors of the hybrid node. However, the shallow processing units typically complete work in an asynchronous manner, making it often difficult to predict or ascertain the particular state of the workload and/or computation kernel at any given time. Thus, conventional checkpointing of hybrid architecture parallel processing computing systems typically remains inefficient and wasteful, as an entire hybrid architecture parallel processing computing system may have to be halted so as to bring the system to a known state.
Consequently, there is a need to checkpoint an application of a hybrid architecture parallel processing computing system in such a manner that accounts for the hybrid nature of the computing nodes and brings the application to a known state.