Field of the Invention
The invention relates generally to distributed computer systems and, more particularly, addressing the failure of a portion of the distributed computer system to complete a distributed task.
Related Art
Distributed computer systems are made up of several computers or nodes that are connected to each other via a communications network. Modern distributed computer systems are capable of performing enormous computing tasks. To do so, they typically take a large computing task and break it down into smaller tasks or “work units,” which can then be distributed amongst several computers or nodes for execution. A work unit is any discreet task designed to be processed by a computer. For instance, in the context of database computing, a work unit might be a subset of data from a query fragment.
A problem can arise when one or more of the computers to which a work unit has been distributed fails to return a result in a timely manner. This can cause the need for the entire large computing task to be attempted again, which leads to delays and inefficient use of resources.
What is needed is a distributed computer system that address the failure of one of its nodes to execute a work unit. Additionally, what is needed is a method that can detect failure situations and retry the pending work allocated to remote nodes instead of failing or waiting indefinitely long for a response.