1. Field of the Invention
The present invention relates to the operation of distributed processing computer systems. In particular it relates to failure recovery in those systems that have a plurality of processing nodes each one having access to a number of shared resources controlled by a master process. Still more particularly, the present invention relates to the management shared access including the passing of a token or write lock that grants permission to one of a number of distributed processes allowing that process to update a data item.
2. Background and Related Art
Distributed computer systems are created by linking a number of computer systems using a communications network. Distributed systems frequently have the ability to share data resident on an individual system. Sharing can take many forms. Simple file sharing allows any of the distributed processes to access file regardless of the physical system on which they reside. Device sharing similarly allows use of physical devices regardless of location. Replicated data systems implement data sharing by providing a replica copy of a data object to each process using that data object. Replication reduces the access time for each processor by eliminating the need to send messages over the network to retrieve and supply the necessary data. A replicated object is a logical unit of data existing in one of the computer systems but physically replicated to multiple distributed computer systems. Replicated copies are typically maintained in the memories of the distributed systems.
Replicated data objects also speed the update process by allowing immediate local update of a data object. Replication introduces a control problem, however, because many copies of the data object exist. The distributed system must have some means for controlling data update to ensure that all copies of the data remain consistent.
Prior art systems control data consistency by establishing a master data object copy in one of the distributed systems. The master copy is always assumed to be valid. Data object update by a system other than that of the master copy requires sending of the update request to the master for update and propagation to all replicas. This approach has the disadvantage of slowing local response time as the master data object update and propagation are performed.
Another means for controlling replicated data is described in Moving Write Lock for Replicated Objects, commonly assigned, filed on Oct. 16, 1992 as Ser. No. 07/961,757 and having attorney docket number AT992-046. The apparatus and method of that invention require that a single "write lock" exist in a distributed system and be passed to each process on request. Data object updates can only be performed by the holder of the "write lock." The "write lock" holder may update the local object copy and then send that update to the master processor for its update and propagation to other processes. The above patent application is incorporated by reference.
The method for determining which of a number of distributed processes is to be master is described in pending patent application Ser. No. 07/961,750 filed Oct. 16, 1992 and entitled Determining a Winner of a Race in a Data Processing System, commonly assigned and bearing attorney docket number AT992-117. The "race" between each process potentially controlling a resource results in the assignment of master status to the process first establishing write control over a Share Control File. Once control has been established by one process, other processes are assigned "shadow" status. Master process failure causes reevaluation of master status. This patent application is also incorporated by reference.
The technical problem addressed by the present invention is providing fault tolerant features to a distributed processing system controlling resource sharing by designating a master process for each shared resource. The problem of systems using write locks or tokens to manage replicated data objects is also addressed. Fault tolerance is required to ensure that no data or updates are lost due to the failure of a master process. Prior art systems, including those referenced above, require the master determination and write lock control to be reinitialized. This could result in loss of data if locally updated data has not been propagated to the master or other replicas.