a. Field of the Invention
The present invention relates to distributed computer systems. More specifically, it relates to a systems and methods for finding and eliminating errors in the user programs of a distributed application.
b. Related Art
In the last two decades, parallel computing has emerged as a formidable and viable programming paradigm in its own right. The emergence can be attributed to two main factors. Firstly, parallel computing provides a mechanism for speeding up computations. Secondly, using multiple computing units enables one to take advantage of resources that may not be available on a particular machine.
One of the manifestations of parallel computing is in the form of a number of processing elements each possessing its own memory and being connected to a common communication network. Programs that run on such a multiplicity of machines are called distributed applications. Various technologies, called middleware, have emerged to enable and enhance distributed programming. Writing such programs can be a difficult and error-prone task. Detecting, locating and eliminating errors in these programs can be a costly and time-consuming process.
The mechanisms for distributed application error detection and resolution in use today, typically require a programmer to use single process debuggers, such as ipmd, dbx and gdb, to control the execution of each of the components of the distributed application, while laboriously controlling their relative speeds of execution and keeping track of the interactions between the various components. This approach is unsatisfactory for a number of reasons. Depending on the relative speeds of execution of the different components and the time taken for messages to traverse the communication network, a distributed application can give rise to a number of execution sequences. Only a few of these execution sequences need be erroneous. The chances of the programmer reproducing the same erroneous execution sequence are small. Also, in order to replay the erroneous execution sequence, the programmer may need to remember or manually record large amounts of information.
There have been several proposals and attempts to provide improved solutions to distributed application error detection and resolution. One approach is to collect information during the execution of the distributed application so as to reconstruct the sequence of global states of the program. The sequence of states is then inspected by a separate process to find the error. A problem with such schemes is that they are inefficient inasmuch as they require a large quantity of information to be recorded and collected in a centralized process. Further, such schemes do not easily scale as the number of components in the distributed application increases.
Another approach is to log the relative order of events that have occurred in the execution, thus enabling the user to replay the same execution. The success of this scheme rests on the ability to log the factors that influence the order of events in the execution. This approach too, does not scale as the number of components in the application and the number of interactions increase.