Most complex computer software applications are run not on a single computer system, but in a distributed system in which multiple computer systems, referred to as nodes, each contribute processing resources and perform different tasks. The computer systems in the distributed system can be connected via a network to each other when geographically dispersed or may operate as a cluster of nodes. In either configuration, typically each node is connected to one or more storage devices either directly or via a network connection. A common configuration is for each node to have its own dedicated storage devices as well as shared storage accessible to some or all nodes in the distributed system.
FIG. 1 is a block diagram illustrating a prior art distributed network in which the invention operates. Node 110A, node 110B and node 110C (collectively referred to as nodes 110) are examples of nodes in the distributed system. Hardware components of a computer system that can be used to provide each of nodes 110 are described in further detail with reference to FIG. 9. Shared storage 150 stores data shared by nodes 110A. In one embodiment, shared storage 150 is a data volume onto which shared data are written to and read from by each of nodes 110.
Nodes 110 are connected to each other and to shared storage 150 via a set of communication links making up a network 102. One of skill in the art will recognize that nodes 110 may be part of a cluster of nodes where all storage is shared, in contrast to the example given in FIG. 1 showing an underlying network rather than connections within the cluster. Network 102 is represented in FIG. 1A as having link 102AB connecting nodes 110A and 110B, link 102BC connecting nodes 110B and 110C, and link 102AC connecting nodes 110A and 110C. Network 102 is also shown as including link 102AS from node 110A to shared storage 150, link 102BS from node 1101B to shared storage 150, and link 102CS from node 110C to shared storage 150. One of skill in the art will recognize that different physical network configurations can be used to implement these communication links. For example, the node-to-node communication links and node-to-storage links may communicate over physically separate networks, such as a node-to-node link over an Ethernet Transmission Control Protocol/Internet Protocol (TCP/IP) network and the node-to-storage links over a separate fibre channel storage area network (SAN). Different protocols are typically used for communicating storage information than the protocols used to communicate between nodes, although the use of different protocols is not a requirement of the invention.
In an alternative implementation, both the node-to-node links and the node-to-storage links can be implemented over the same physical network if that network can carry both input/output (I/O) storage communication and inter-node communication simultaneously. Examples of such implementations are TCP/IP over an underlying fibre channel storage area network (SAN), a multiplexing of multiple protocols over Infiniband (IB), or a storage protocol such as Internet Small Computer System Interface (iSCSI) layered over TCP/IP on an Ethernet network supporting a high bit rate (i.e., one to ten gigabits per second (Gbps)).
To enable nodes in the distributed system 100 to share data, nodes can be allowed to read from and write to shared storage 150. However, the possibility exists that one node may be writing data to a region of shared storage 150 and/or performing another type of operation, such as a reconfiguration of shared storage 150, that may invalidate data that another node is simultaneously reading. A solution is needed to recognize invalidating operations, such as a write operation overlapping with the read operation, and to inform a reader when the data need to be read again after the invalidating operation corrupted and/or caused the copy of the data read by the reader to be out-of-date. Preferably, the reader is informed of a location from which current valid data can be obtained.