Information drives business. A hardware or software failure affecting a data center can cause days or even weeks of unplanned downtime and data loss that could threaten an organization's productivity. For businesses that increasingly depend on data and information for their day-to-day operations, this unplanned downtime can also hurt their reputations and bottom lines. Businesses are becoming increasingly aware of these costs and are taking measures to plan for and recover from hardware and software failures.
Most complex business applications are run not on a single computer system, but in a distributed system in which multiple computer systems, referred to as nodes, each contribute processing resources and perform different tasks. In such an environment, disruption due to hardware and software failures can be lessened or prevented using a strategy known as clustering. In a clustered environment, computer systems and storage devices are interconnected, typically at high speeds within a local data center, for the purpose of improving reliability, availability, serviceability, and/or performance via load balancing. Redundant interconnections between the computer systems are typically included as well, and the collection of computer systems, storage devices, and redundant interconnections is referred to herein as a cluster. In some implementations, the cluster appears to users as a single highly available system. Different types of clusters may be established to perform independent tasks, to manage diverse hardware architectures performing similar tasks, or when local and backup computer systems are far apart physically.
In some clustering environments, only one of the computer systems in the cluster provides processing resources with respect to a particular software application. In other clustering environments, processing for a single software application is distributed among nodes in the cluster to balance the processing load.
Within a single computer system, multiple threads executing a given software application may access and/or update the same data. The term ‘thread’ is used to describe the context in which a computer program is being executed. This context includes the program code, the data for execution of the program code, a stack, a program counter indicating a memory location from which the next instruction will come, and state information. Coordination is necessary to ensure that one thread does not read shared data at the same time that another thread is updating that data, thereby possibly resulting in data inconsistency depending upon the timing of the two operations. In clustering environments where processing for a given software application is “load balanced,” threads that share data can be running on different nodes within a cluster.
Coordination between threads accessing shared data is often implemented using locks. Typically, a lock is software that protects a piece of shared data; for example, in a file system, a lock can protect a file or a disk block. In a distributed system, a lock can also protect shared “state” information distributed in memories of each node in the system, such as the online or offline status of a given software application. All shared data is protected by a lock, and locks are typically managed by a lock manager, which often provides an interface to be used by other application programs.
A lock is requested before the calling application program can access data protected by the lock. A calling application program can typically request an “exclusive” lock to write or update data protected by the lock or a “shared” lock to read data protected by the lock. If the calling application program is granted an exclusive lock, then the lock manager guarantees that the calling program is the only thread holding the lock. If the calling program is granted a shared lock, then other threads may also be holding shared locks on the data, but no other thread can hold an exclusive lock on that data.
The lock manager cannot always grant a lock request right away. Consider an example where one thread has an exclusive lock L on a given set of data, and a second thread requests shared access to the given set of data. The second thread's request cannot be granted until the first thread has released the exclusive lock on the given set of data.
A lock can be placed on data that are stored on a shared disk. Locks can also be placed on shared data stored in memory for each node, where the data must be consistent for all nodes in a cluster. For example, nodes in a cluster can share information indicating that a file system is mounted. A lock can be placed on the shared state information when the state of the file system changes from mounted to not mounted, or vice versa.
FIGS. 1 through 4 provide examples of prior art messaging used to implement locks for data sharing. FIG. 1 is a block diagram illustrating prior art initialization of a lock. Two nodes, node 110A and node 110B, share data 152 protected by a lock 150. Lock 150 is managed by lock manager 160, which includes a module on each of nodes 110A and 110B, respectively, lock agent 130 and lock master 140. In many environments, a single lock master exists for each lock, and the lock master resides on one of the nodes. In the example shown, lock master 140 resides on node 110B. Lock master 140 tracks the access levels for a given lock in use on all nodes. Lock master 140 also maintains a queue of unsatisfied locking requests, which lock master 140 grants as threads unlock the corresponding lock. Different locks may have lock masters on different nodes, and all nodes agree on which node masters a given lock.
Each node can have a program that handles access to data protected by each lock. In this example, lock agent 130, a module of lock manager 160, runs on node 110A to provide access to data 152 protected by lock 150. Node 110B may also include another lock agent (not shown) to handle locks for clients on node 110B. If lock agent 130 itself does not have the access level requested by a client, such as client 120, running on node 110A, lock agent 130 calls lock master 140 to request the desired access level for node 110A. Lock master 140 keeps track of the access levels, also referred to as lock levels, held by all of the lock agents, such as lock agent 130, on each node.
Initialization of a lock, such as lock 150, is initiated by a client, or thread, such as client 120 of node 110A. A client calls a lock agent, such as lock agent 130, for the lock protecting the data of interest, such as lock 150. In the embodiment shown in FIG. 1, initialization is performed before the client is ready to use the data and allows a lock agent to prepare for that client's use of the lock. For example, the lock agent may allocate data structures or perform other functions to prepare for the client's use of the lock.
In action 1.1, client 120 running on node 110A requests lock agent 130 to initialize lock 150 on data 152. In action 1.2, lock agent 130 sets up data structures necessary for client 120 to use data 152 protected by lock 150. No communication with lock master 140 is needed to set up the data structures, which are discussed further below with reference to FIG. 3. In action 1.3, lock agent 130 informs client 120 that lock 150 is initialized.
Subsequent requests to initialize locks from client 120 or other clients (not shown) on node 110A can be granted by lock agent 130 by performing actions such as actions 1.1, 1.2, and 1.3. In other embodiments, initializing a lock may include communication with a lock master, such as lock master 140.
FIG. 2 is a block diagram illustrating a prior art first request for access to data protected by a lock that has been initialized and grant of the first request in the environment of FIG. 1. In action 2.1, client 120 requests shared access to data 152 protected by lock 150, which was initialized as described with reference to FIG. 1 above. In action 2.2, lock agent 130 determines that access to lock 150 has not yet been granted to lock agent 130. In action 2.3, lock agent 130 requests shared access to data 152 protected by lock 150 from lock master 140 running on node 110B. Lock master 140 determines in action 2.4 that no other client is currently holding lock 150, and therefore that no contention exists for data 152 protected by lock 150. Contention indicates that other nodes already hold conflicting access levels for this lock. For example, if a node holds shared access to a lock, then no node can be granted exclusive access to data protected by the lock until the shared access is relinquished.
In action 2.5, lock master 140 grants shared access to data 152 protected by lock 150 to lock agent 130. Now that lock agent 130 has been granted shared access to data 152, lock agent can grant shared access to any client running on node 110A that wishes to read data 152. A grant of access to a lock agent, such as lock agent 130, can be viewed as a grant of access to data protected by the lock, here lock 150, corresponding to the lock agent, for the entire node on which the lock agent is running. Lock agent 130 handles requests for access by client processes running on its respective node, in this case, node 110A. In action 2.6, lock agent 130 grants shared access to data 152 protected by lock 150 to client 120.
As shown in FIG. 2, each time lock agent 130 does not have the level of access requested by a client, such as client 120, lock agent 130 sends a message to lock master 140. When a lock agent must communicate with a lock master in order to obtain access to data protected by a lock on behalf of a client, locking is referred to herein as being performed in accordance with a “normal” lock protocol, and the lock itself is referred to as a “normal” lock.
Messaging between nodes is very expensive when compared to normal instruction execution; for example, on a typical computer system, a program can execute 250,000 instructions in the time it takes to send, receive, and process a message. Communicating with other processes on the same node is much less expensive, and therefore it is desirable, when possible, to minimize messages between nodes in favor of communications between processes on the same node. Using lock agents, such as lock agent 130, helps to minimize messaging because the lock agent can grant the access level that the lock agent itself has been granted. However, when the lock agent has not been granted access, the lock agent/lock master scheme still requires significant messaging whenever the lock agent has not already been granted the access level desired and the lock master is running on a different node.
FIG. 3 is a block diagram illustrating a prior art first request for access to data protected by a lock with contention between nodes and a grant of the first request. FIG. 3 is shown in the environment of FIGS. 1 and 2. In action 3.1, client 120 requests exclusive access to data 152 protected by lock 150. In action 3.2, lock agent 130 determines that exclusive access to data 152 protected by lock 150 has not yet been granted to lock agent 130. In action 3.3, lock agent 130 requests exclusive access to data 152 protected by lock 150 from lock master 140 running on node 110B, in accordance with the normal lock protocol.
Lock master 140 determines in action 3.4 that data protected by lock 150 are currently held at a shared access level by lock agent 330 running on node 110C, in contrast to the finding in FIG. 2 that no contention was present. Because the data protected by lock 150 is currently held at a shared access level, exclusive access cannot be granted to lock agent 130. Lock master 140 has three options at this point: (1) wait until the client of lock agent 130 holding lock 150 releases lock 150; (2) grant shared access rather than exclusive access to lock agent 130; or (3) request lock agent 330 to release lock 150.
In this example, lock master 140 takes the third option, and in action 3.5, lock master 140 requests lock agent 330 to lower the access level with which lock agent 330 holds data 152 protected by lock 150. Lowering the access level with which a lock agent holds data protected by a lock is also referred to herein as “lowering the access level for the lock,” and locks can be referred to as having an access level. Lowering the access level is also referred to herein as “releasing the access level” or releasing the lock. A request to lower the access level can also be referred to as a revocation request.
In response to the revocation request to lower the lock access level for lock 150, in action 3.6, lock agent 330 waits on clients on node 110C to finish using data 152 so that it can lower the access level of lock 150. In action 3.7, lock agent 330 sends a message indicating that the access level of lock 150 is lowered to a “no lock” access level. Lock master 140 records the fact that lock agent 330 no longer holds lock 150 in a data structure, which is described with reference to FIG. 4 below. No contention exists, which allows exclusive access to be available to lock agent 130.
In action 3.8, lock master 140 grants exclusive access to data 152 protected by lock 150 to lock agent 130. Now that lock agent 130 has exclusive access to data 152, lock agent can grant exclusive access to data 152 protected by lock 150 to client 120.
In this example, an additional message was sent by lock master 140 in action 3.5 between nodes to handle contention for data 152 between nodes 110A and 110B. The other two options described above, waiting until the client of the lock agent holding the lock has released the lock, and granting shared access rather than exclusive access, do not require lock master 140 to send additional messages to lock agent 330. Waiting until lock 150 is released would eliminate action 3.5, where lock master 140 requests lock agent 330 to revoke access to data 152 protected by lock 150. However, access to data 152 by client 120 would be delayed until lock agent 330 voluntarily releases lock 150 on data 152. Granting shared access instead of exclusive access would change actions 3.8 and 3.9 to grant shared rather than exclusive access, and would eliminate action 3.5. However, a grant of shared rather than exclusive access would not satisfy the need of client 120, possibly resulting in additional messaging for client 120 to obtain the access level needed.
FIG. 4 is an example of prior art data structures maintained by the lock agent and lock master of FIGS. 1 through 3. Lock agent 130 of FIG. 1 (not shown) maintains lock agent data structure 432 to track access levels granted to the node on which lock agent 130 resides, node 110A of FIG. 1 (not shown). For each lock, lock agent data structure 432 includes lock identifier 434, the current access level for this node 436, and state information 438. State information 438 enables lock agent 130 to manage multiple requests for the lock identified by lock identifier 434.
Lock master 140 of FIG. 1 (not shown) maintains lock master data structure 442 to track access levels granted to each node. In some embodiments, lock master 140 may track each lock request from each thread on every node, and the data structure in such an embodiment would track lock- and thread-level information. In this example, lock master data structure 442 includes lock identifier 444, access level for node X 446X, access level for node Y 446Y, and access level for node Z 446Z, and state information 448.
As illustrated in this example, communication to request and grant locks in a multi-node environment has heretofore been very expensive and significantly reduces time available for processing instructions. What is needed is a system that minimizes messaging between nodes, while allowing locks to be used to enable data sharing among multiple threads running on the nodes.