A distributed computer system includes multiple autonomous computers, each referred to as a node, that communicate with each other through a computer network. The computers interact with each other in order to achieve a common goal, and a problem is divided into many tasks, which are distributed across the nodes of the computer system. Often, the distributed computer system is able to concurrently process several computations and to run parallel computer applications on its nodes. A computer cluster is a form of such a distributed computer system.
A computer cluster is a group of linked computers working together so that they can appear as a single computer. The computer cluster can be deployed to improve one or more features of performance, reliability, or availability than a single computer in the cluster, and computer clusters are often more cost effective than single computers of similar speed and performance. Often, a computer cluster includes a group of identical or similar computers tightly connected to each other through a fast local area network. One category of such a computer cluster is a Beowulf Cluster, which is typically used for high performance parallel computing.
Parallel, or concurrent, computer applications include concurrent tasks that can be executed on distributed computer systems such as a Beowulf Cluster. Concurrent applications provide increased performance to solve complex problems over sequential applications, but parallel programming also presents great challenges to developers. Parallel applications are written to include concurrent tasks distributed over the computers or nodes. To write effective parallel code, a developer often identifies opportunities for the expression of parallelism and then maps the execution of the code to the multiple node hardware. These tasks can be time consuming, difficult, and error-prone because there are so many independent factors to track. Understanding the behavior of parallel applications and their interactions with other processes that are sharing the processing resources or libraries of a distributed computer system is a challenge with the current developer tools.