1. Technical Field
This invention generally relates to data processing, and more specifically relates to the sharing of tasks between computers on a network.
2. Background Art
Since the dawn of the computer age, computer systems have become indispensable in many fields of human endeavor including engineering design, machine and process control, and information storage and access. In the early days of computers, companies such as banks, industry, and the government would purchase a single computer which satisfied their needs, but by the early 1950""s many companies had multiple computers and the need to move data from one computer to another became apparent. At this time computer networks began being developed to allow computers to work together.
Networked computers are capable of performing tasks that no single computer could perform. In addition, networks allow low cost personal computer systems to connect to larger systems to perform tasks that such low cost systems could not perform alone. Most companies in the United States today have one or more computer networks. The topology and size of the networks may vary according to the computer systems being networked and the design of the system administrator. It is very common, in fact, for companies to have multiple computer networks. Many large companies have a sophisticated blend of local area networks (LANs) and wide area networks (WANs) that effectively connect most computers in the company to each other.
With multiple computers hooked together on a network, it soon became apparent that networked computers could be used to complete tasks by delegating different portions of the task to different computers on the network, which can then process their respective portions in parallel. In one specific configuration for shared computing on a network, the concept of a computer xe2x80x9cclusterxe2x80x9d has been used to define groups of computer systems on the network that can work in parallel on different portions of a task.
When different computers cooperate to perform a given task, it is desirable to have some fault-tolerance so the computers will know whether or not the task was successfully completed. One way to provide fault-tolerance is to have one of the computer systems act as a leader that monitors completion of the task by the different computers. However, providing a leader is a complex and problematic solution, and there is no guarantee that the leader will run without errors. Another way to provide fault-tolerance is to define global state data that resides in a data structure that may be accessed by any of the computer systems. This scheme allows all the participating computer systems to know if a failure occurs, but this requires some globally-accessible data store. However, accessing this store can result in substantial performance penalties for remote nodes because wide area networks (WANs) typically have poor performance. In addition, a globally-accessible data store provides a single point of failure. A globally-accessible data store also requires that all nodes recognize and have the capability to communicate with the data store (e.g., all nodes need a global file system, a compatible file system, etc.). Without a mechanism for providing improved fault-tolerance in a networked computing system, the computer industry will continue to suffer from known fault-tolerance mechanisms and methods that are excessively inefficient and complex.
According to the preferred embodiments, a clustered computer system includes multiple computer systems (or nodes) on a network that can become members of a group to work on a particular task. Each node includes group state data that represents the status of all members of the group. A group state data update mechanism in each node updates the group state data at acknowledge (ACK) rounds, so that all the group state data in all nodes are synchronized and identical if all members respond properly during the ACK round. Each node also includes a main thread and one or more work threads. The main thread receives messages from other computer systems in the group, and routes messages intended for the work thread to either a response queue or a work queue in the work thread, depending on the type of the message. If the message is a response to a currently-executing task, the message is placed in the response queue. Otherwise, the message is placed in the work queue for processing at a later time.