The present invention relates generally to distributed data storage systems and more particularly to determining resource availability in such systems during transient failures.
In the past, the storage network, which linked a host such as a computer and its related data storage system, was fairly slow. As a result, data was generally not distributed across the data storage system because such distribution would increase access time.
With the advent of fiber optic technology for the data storage network, it becomes possible to have extremely fast access to data even when it is placed remotely and it is placed on a number of different data storage devices. One of the advantages of placing data, or xe2x80x9cstripingxe2x80x9d data, on a number of different data storage devices is that the data storage devices can be accessed in parallel so the bandwidth can be substantially increased. Essentially, striping involves placing a first set of bytes on the first data storage device, the next set of bytes on the next data storage device, etc., and then wrapping around so the fourth set of bytes is placed on the first data storage device, etc. With three data storage devices, there is essentially three times the bandwidth over that of a single data storage device. This is essentially how a RAID array (redundant array of inexpensive disks) works.
In addition, in RAID arrays, there is hardware support for striping and for xe2x80x9cmirroringxe2x80x9d. xe2x80x9cMirroringxe2x80x9d provides for replication of data on a number of separate devices for more rapid access and redundancy. The reason hardware support is required, for example for mirroring, is if the data is read-only, it can be read from whichever data storage device is faster, but if writes are performed, the write must be propagated to all copies. Further, if two hosts are writing to the same data at the same time, it is necessary for the writes to be consistent. While the hardware support for these storage arrays are fairly well developed, the same is not true for networks of data storage devices.
For distributed data storage systems, problems occur when some data storage devices fail. The data storage devices stop responding to messages and send no further messages. This has the effect of logically separating the failed data storage devices from the rest. Portions of the data storage network can also fail, which can lead to a xe2x80x9cpartitioningxe2x80x9d. In partitioning the data system splits the hosts and the data storage devices in the data storage system into two or more xe2x80x9cpartitionsxe2x80x9d. Within a partition, all the hosts and data storage devices can communicate with each other, but no communications are possible between partitions. In many cases, the data storage system can not distinguish between a partitioning and the failure of one or more data storage devices. Thus, it is not possible to determine resource availability.
In particular, data storage systems that provide xe2x80x9cvirtual storesxe2x80x9d to users present a special problem. A xe2x80x9cvirtual storexe2x80x9d is a logical structure that appears to the host application as if it were a data storage device of a given capacity, but in reality the data in the virtual store is spread over multiple real data storage devices. For example, data can be minored to improve its availability and can be striped to improve bandwidth. Both of these approaches result in multiple data storage devices being involved in storing the data for the virtual store. When the virtual store is updated, all the data storage devices holding part of the virtual data space being updated must be updated. If not one data storage device will lose synchronization with the others, and a host that tries to read from that data storage device will see inconsistent data.
During partitioning, a host will be able to read some data storage devices, but not necessarily all. Further, two hosts in two different partitions will be only able to reach devices in their own partitions. If left uncontrolled, the data storage devices in different partitions will lose synchronization if the two hosts write only to the devices within their own partitions. If data are supposed to be mirrored, or if there are consistency requirements between different data, this is a major problem.
The typical solution is to xe2x80x9clock outxe2x80x9d access to data in all but at most one partition. That is, at most one partition is chosen as xe2x80x9cactivexe2x80x9d, and only hosts in that partition can access data. In all other partitions, hosts will be locked out or denied access until the data storage devices or the network are repaired.
The most common way of ensuring that the data are accessible in at most one partition is to require that there be a xe2x80x9cquorumxe2x80x9d of data storage devices in the partition. Typically, a xe2x80x9cquorumxe2x80x9d is defined as a majority of the data storage devices that store copies of the data. At the present time, it is entirely possible that no partition will contain a majority of the devices, and so the data will be totally inaccessible.
In a distributed data storage system, a quorum is not enough for correct operation. In addition, it is important that all of the data space in the virtual store be covered by data storage devices in the partition. For example, a virtual store can have its data space divided into three parts. Each part is mirrored so that six separate data storage devices each hold a portion of the data for the virtual store. A simple majority of the data storage devices can be formed by taking both of the mirrors of the first two-thirds of the data space. However, there may be no devices in the partition storing any of the last third of the data. This means that all the data would be unavailable despite having a quorum because of the lack of complete xe2x80x9ccoveragexe2x80x9d of the data. Thus, a distributed data storage system requires both a quorum of devices and coverage of the data space.
In the past, mechanisms for establishing a quorum were only concerned with the replication of a single datum.
The data storage system was considered as moving through a sequence of xe2x80x9cepochsxe2x80x9d with a failure or repair defining the transition from one epoch to the next. At each epoch boundary, a protocol is run in each partition to determine what data storage devices are available in the partition and whether access will be allowed in the partition during that epoch.
At the end of the protocol, the data storage devices in at most one partition will have determined that they have a quorum so that access can continue in that partition. Those data storage devices may then elect to regenerate replicas into other data storage devices in that partition so that a proper degree of redundancy is available. This complicates the protocol for deciding when a partition has a quorum, because the population of data storage devices from which a quorum must be drawn changes over time. To handle the changing population of replicas, each replica maintains an epoch number and a list of the replicas active in that epoch.
Protocols of this type are known to provide good availability as long as there are three or more replicas. When there are only two replicas, both must be available in a partition to have more than half available, so that the failure of at least one renders the data unavailable. This results in lower availability than with a single replica. Thus, there is no truly effective way of determining data storage resource availability during data system failures for distributed data storage systems.
The present invention provides a data storage system including a virtual data store having a plurality of portions of data and a plurality of data storage devices connectable to said virtual store capable of storing portions of said data of said virtual store. A coordinator is connectable to at least one of said plurality of data storage devices and is responsive to information therein to allow recovery of said data storage system after a partitioning of said plurality of data storage devices when said at least of one of said plurality of data storage devices contains all of said plurality of portions of said data to have complete coverage of said virtual store.
The above and additional advantages of the present invention will become apparent to those skilled in the art from a reading of the following detailed description when taken in conjunction with the accompanying drawings.