In the past, a computing system included a number of computers connected by a network, which allowed the computers to communicate and pass information and/or data between themselves. The networks ranged from a local system, such as a local area network (LAN), to a very large and expansive network, such as a wide area network (WAN) or the Internet. The computing system also had various operating systems, storage mediums, and other processing resources interconnected and accessible via the network through various software products.
Further, software information service systems and applications were included in the computing system. These systems and applications ranged from commercially available software applications (e.g. spreadsheets, word processors, databases) to custom developed software products tailored for specific use within specific computing systems.
Customers or clients (i.e. companies, organizations, groups or individuals) can also be viewed as part of the computing system. Generally, there are multiple types of clients associated with a distributed computing system including application users, application developers, and system administrators. Within the distributed computing system, the many different clients typically require access to a number of different application systems or processes concurrently and it is necessary to allocate resources and prevent redundant use of resources.
In these types of systems, there is a recurrent problem termed the “online validation problem” (OVP). In the OVP, a group of processes validate some request for service (e.g., if OVP is used to solve a concurrency control problem, the process is requesting to enter the critical section). As the result of the validation, only one process accepts the request (i.e., following the previous example, only one process enters the critical section), and all the others reject the request. A request is accepted if it has not been accepted before, and rejected otherwise.
More specifically, a group of processes want to validate some request for service R. Processes may crash during the execution of the validation so a solution must respect two requirements: (a) if a process accepts R, then no other process accepts R, and a process does not accept R more than once; and (b) if no process, among the ones validating R, crashes, then there is a process that eventually accepts R.
Due to process crashes, solving OVP is not simple, as described next.                (i) Making sure that R is not accepted more than once can be trivially satisfied by a protocol that systematically refuses R. Such solution, however, does not ensure requirement (b). Fulfilling requirements (a) and (b) induce processes to exchange information about the validation of R in such a way that if two (or more) processes validate R, only one accepts it. This is no trivial task because processes have incomplete information about one another.        (ii) One way to ensure requirement (a) is to have a centralized process that decides whether R should be accepted or rejected. Nevertheless, such a solution has a single point of failure: if the process responsible for the decision crashes, requirement (b) will not be satisfied. A solution to this problem is to detect the crash of processes, but this means that failure detection has to be accurate, something difficult to achieve in most practical cases.        (iii) Another way is to have transactional database system to solve OVP, however, in order to guarantee that only one process accepts R in case of failures, the transactional database may have to delay the validation until after some processes recover. Transactional-based solutions may delay the result of the validation and make extra assumptions (i.e., no permanent process crash).        (iv) Solutions relying on perfect failure detection mechanisms can result in unexpected behaviors if the failure detectors make mistakes. Given the unpredictable behavior of commercial network systems, it is hard to tell whether a non-responding process has crashed or is just too busy. It is actually very dangerous to design a system that relies on perfect accuracy of failure detectors. It is difficult to design systems that can rely on unreliable failure detectors because such systems must be flexible enough to accept wrong failure suspicions, but be strong enough to guarantee that some useful computation is performed.        
Solutions to problems similar to OVP largely rely on transactional databases to prevent several processes from accepting the same request R. The key idea is to synchronize transactions (e.g., by means of locking) at some central validation server that only allows one transaction related to R to be active at a time. In such a scheme, the first transaction to lock the database record related to R accepts R, and all the others reject R. Relying on a centralized database may block the system in the event of single crashes, and so, does not fulfill requirement (b) presented in the previous section. A solution to this problem is to use a highly available database, which is expensive.
Database systems supporting asynchronous data replication, such as Tandem Remote Data Facility (RDF) and Microsoft SQL Server, are immediately ruled out because such systems provide weak consistency, and may allow R to be accepted more than once even in executions without crashes. Synchronous data replication systems, such as Oracle Parallel Server (OPS), and Informix Extended Parallel Server (XPS) use clusters with or without shared disks, and do not suffer from the same problem. However, a failover requires log based recovery: if one process takes over for a failed process, it must reconcile its own state with the log of the failed process. Moreover, to use a parallel database system as a highly available transaction processing system, database processes on different machines must access the same disks. This requires special hardware/software, such as high availability clusters.
Several protocols have been proposed to implement “quorum systems”. Quorum systems are a distributed, fault-tolerant synchronization mechanism for replicated databases and objects in general. Although several variants exist, they essentially all detect conflicting requests by means of quorum intersections. Briefly, in order to treat a request r, a server has to gather the approval of a quorum, for example Q(r), of servers. If requests r1 and r2 conflict, Q(r1) and Q(r2) are such that there is at least one server in any intersection of Q(r1) and Q(r2) which detects the conflict, and refuses access to either requests r1 or r2. Quorums are a safety mechanism (i.e., they prevent multiple processes from accepting r), but thus far, no liveness guarantees have been associated with them (i.e., it may happen that even in execution where no process crashes, no process accepts r). Moreover, when used with replicated databases, quorums may lead to distributed deadlocks, which are expensive to resolve.
Mutual exclusion protocols solve the resource allocation problem, which, roughly speaking, requires that at most one process be in the critical section at a time, and if several processes request access to the critical section, one should be granted permission. Given the similarities between OVP and the resource allocation problem, it would be thought that using a mutual exclusion protocol would solve OVP. However, few studies on resource allocation address high availability issues. One study has proposed a modular algorithm for resource allocation in distributed systems that tolerate the failure of some components of the system. However, the study assumes one processor for each resource, and the failure of such processor renders the resource unavailable (although other resources can still be accessed).
Solutions to the OVP have been long sought, but have long eluded those skilled in the art.