Parallel computing is the simultaneous use of more than one central processing unit (“CPU”) to execute a program. Ideally, parallel processing makes a program run faster because there are more CPUs executing it. Parallel computing takes many forms including clusters and grid computing, where networked computers execute portions of a program simultaneously. Developing parallel applications is notoriously difficult and parallel applications are commonly considered the most difficult type of application to debug.
Previous attempts to debug parallel applications have been either too cumbersome, or too restrictive. In one solution, each process making up the parallel application is started in a suspended mode at each computer or node of the cluster. Similarly, a debugging client is also executed at each node. Each debugging client attaches to the respective process running on that particular node, and the processes are restarted simultaneously. The resulting debugger information is then collected from each node to debug the total application. This solution suffers from a lack of scalability, as the number of nodes in the grid grows, the number of debugger clients required grows, making debugging large scale parallel applications very difficult.
Another solution, utilized in the Totalview® debugger by Etnus, involves the use of a separate debugging application programming interface (“API”) that is built directly into the message passing interface (“MPI”). Application designers add code from the debugging API to allow them to more easily debug the resulting parallel applications. This solution is inflexible, because it ties the application to the particular debugger chosen, as well as provides no support for older parallel applications. In addition, because this solution is tied to the MPI, it only allows debugging of the portions of the application that are actually executed in parallel. Frequently, parallel applications only execute some of the code in parallel.
Scalable and flexible systems and methods are needed for parallel debugging that are independent from the application being debugged and the underlying MPI.