A digital data processing system, or computer, typically includes a processor, associated memory and input/output units enabling a user to load programs and data into the computer and obtain processed data therefrom. In the past, computers were expensive, and so to be cost effective had to support a number of users. More recently, however, the cost of computers, particularly the processors and memories, has decreased substantially, and so it is relatively cost effective to provide a computer to one or at most only a few users.
A benefit of providing only a single computer for a large number of users has been that the users can easily share information. Thus, for example, if all persons working in a bookkeeping or accounting department use a single common computer, they can maintain common accounting and bookkeeping databases up to date, and when necessary, accounting reports can be generated from those databases. However, if they use separate computers, the data is stored in separate databases, on each computer, and so generating accounting reports can be more difficult.
As a result, networks have been developed to provide a distributed computer system, that is, a system that permits diverse computers to communicate and transfer data among them. In addition, these distributed networks allow the sharing of expensive input/output devices, such as printers and mass storage devices, and input/output devices that may be rarely used, such as links to the public telecommunications network. In a distributed network, each computer is a node which communicates with other nodes over one or several wires. In addition, nodes may be provided that store and manage databases or other data files on mass storage devices, or that manage printers or links to the public telecommunications network.
A problem arises with distributed networks, however, since resources used in the distributed network, such as programs and data files, the input/output devices and so forth, are typically stored in, or connected to, only some of the nodes. If clients on other nodes need to use such resources, they must be able to find them. Each client node has had to maintain a file identifying the location, within the distributed network, of all resources that are available to the client. With a large distributed network with many resources, this arrangement requires substantial amounts of memory at each client node devoted only to the storage of such location information. In addition, maintaining location information in the diverse client nodes in an updated and current condition is difficult and requires processing by the client node and transfers over the network which could otherwise be used for more useful information processing purposes.
More recently, naming services have been developed that maintain the identification of the locations of the resources available in a network. Naming services maintain the location information in only a few locations in the network, and provide the information to a client node on request. In addition, the naming services update the location information over the network without client node processing.
In a distributed network data processing system having a naming service, the naming service is simultaneously available at many locations, or nodes, in the network. Some of these nodes are clients of the service, and others are servers. The servers are algorithms that provide the naming service to the clients. The clients are themselves algorithms which use the naming service provided by the servers in the distributed network system. The servers must all run similar or identical versions of the software that implements the naming service. When it becomes necessary or desirable to upgrade the naming service to change its behavior or add new features the software at all of the server nodes must be changed.
One prior art approach to this problem requires that all servers be taken down, reconfigured, and brought back up again. A second prior art approach performs the upgrade as an atomic action using a multi-phase commitment protocol. Because it is rarely feasible to shut down all of the servers simultaneously to install new software, the first approach is undesirable. In the second approach, the multi-phase commitment protocol has several disadvantages, such as locking out client access while in execution and requiring that at least a majority of the replicas of any replicated piece of the system be brought up simultaneously. This protocol is also limited in application because it cannot scale very large systems.
Each name that is processed by the naming service of a distributed network processing system denotes a single, unique object. Names for objects are recorded in directories, which themselves have names. A directory contains entries comprising both object entries and child pointer entries. An object entry consists of the object's name and a set of attributes for the object, most prominent of which is the network address where the object currently resides. Child pointer entries link the directories together into a rooted tree, in which there is a single path from the root directory through a set of child directories, to the desired named object.
A tree of directories, starting at a root, is called a namespace. A namespace is stored in a partitioned, partially replicated database. The database is partitioned because parts of the namespace are stored in different locations. The database is partially replicated because part of the namespace may be simultaneously stored in multiple locations. The directory is a unit for the purposes of both partitioning and replication. A collection of copies of directories stored on a particular node is called a clearinghouse. The partitioning is accomplished by controlling which directories are stored in which clearinghouses. The replication is accomplished by storing a directory in more than one clearinghouse.
Clearinghouses are either "up" or "down". When a clearinghouse is "up" at a given node, that node is acting as a nameserver. A nameserver can be controlling more than one clearinghouse simultaneously, especially when the failure of one nameserver has resulted in a clearinghouse "moving" to a new nameserver.
A copy of a directory stored in a particular clearinghouse is called a replica. In order to simplify the algorithms for name creation and general namespace maintenance, one of the replicas of a directory is designated to be the master replica for that directory. Creation of new child pointer entries is permitted only through the master replica for the parent directory. Creation of object entries, in addition to any update or deletion, may be directed to another kind of replica storing the appropriate directory, called a secondary replica. A third kind of replica, the read-only replica, only responds to lookup requests and is not permitted to perform creations, updates, or deletions on behalf of clients.
The naming service maintains a distributed database on behalf of its clients. This database does not have the usual characteristics of a distributed database since it provides very loose consistency guarantees to allow high levels of partitioning and replication. A client may get different answers depending on which replica of a directory is queried if updates are still being propagated through the system.
Updates to a namespace are timestamped and applied such that the update with the latest timestamp wins. The updating algorithms are designed such that all updates are "total". This means that an update is always applied irrespective of the history of past updates. The updates are also "idempotent", meaning that multiple applications of an update to the database have the same effect as a single application of the update. Finally, the updates are "commutative", meaning that the updates are applied in any order with identical results.
The primary algorithm for producing convergence among the replicas of a directory is called the skulker. The skulker operates independently on each directory in a namespace. The skulking operation ensures that all replicas of a directory have absorbed all updates applied to any replica prior to the time the skulk started. The more frequently that skulks are run, the more up-to-date all replicas of a directory are maintained.
Each skulking operation gathers up all updates made to all of the replicas since the last skulk and applies them to the clearinghouse where the skulk is running. Each skulk also spreads all the gathered updates to all replicas of the directory. Finally, each skulk informs all replicas of the timestamp of the latest update all of them are guaranteed to have seen. This timestamp is known as the "AllUpTo" value of the directory.