Cluster computing is becoming an increasingly important type of computing as high performance computing gains in importance through the various domains it touches from scientific computing to financial computing to entertainment and manufacturing, to name but a few.
Cluster computing systems allow multiple computing nodes to work together in accomplishing a computational task. The cluster presents a unified system image, such that a client looking into the cluster does not see any single node of the cluster, rather the whole cluster system. The plurality of computing nodes is typically connected through one or more computing networks such that each node in the cluster is capable of communicating with every other cluster node. The computers in a cluster typically share a disk, a disk array, or other nonvolatile mass storage subsystems, such as RAM drives. Computers that are merely networked, such as clients of the Internet or LAN, are not considered a cluster because they necessarily appear to users as a collection of connected computers rather than a single computing system. “Users” in this context can include both human users and application programs, and which programs include tasks, threads, processes, routines, and other interpreted or compiled software.
Although every node in a cluster can be the same type of computer, a major advantage of clusters is the support for heterogeneous nodes. As the computing power available in all types of computing devices continues to increase, it is entirely possible that a cluster could include computing systems such as a graphics workstation, diskless computer, laptop, a symmetric multiprocessor, and multiple versions of servers.
In a computing cluster, it must be possible to run an application program on the cluster without requiring that the application program distribute itself between the nodes. This is accomplished in part by providing cluster system software that manages use of the cluster nodes by application programs. But such complex software systems are not without implementation and operational complexities. Software errors, omissions, or incompatibilities may bring to a halt (or crash) any useful processing on a node. The goal of maintaining cluster availability dictates rapid detection of the crash and rapid compensation by either restoring the node or proceeding without it. Detection and compensation may be performed by cluster system software or by a cluster-aware application.
Debuggers may also be used by programmers to identify the source of certain problems. Currently, there are no great parallel debuggers. Moreover, conventional debugging breakpoints are not suited to debugging large scale cluster and deployed applications. Traditional debugging includes putting breakpoints everywhere using a special command string that is fairly arcane, and which could be looked at dynamically by completely stopping the program and running some sort of macro that logs information and then allows the program or process to continue execution. However, application of such techniques on parallel processes can severely impact operation of the cluster. Accordingly, there is an unmet need for an improved debugging mechanism in cluster computing system and/or distributed applications environment.