The present invention relates to data storage systems, and more particularly, to a method for reading and writing data across multiple storage devices.
To simplify the following discussion, the present invention will be discussed in terms of a data processing system in which multiple computers are connected to multiple storage devices via a network. There are a number of situations in which a computer on the network needs to write data to more than one disk at the same time, and this write operation must be completed before any other computer can access the data records written. Such write operations are often referred to as xe2x80x9catomic writesxe2x80x9d
For example, consider a network in which a mirrored copy of a disk is maintained on a separate server to allow recovery from errors and server failures. Each time a computer writes a record to a file on this disk, the same record must also be written to the mirrored disk. If two computers on the network write data to the same record, the network delays can cause inconsistencies in the data storage system. The first computer sends two messages, one to each disk, with its data. Denote the message sent to the first disk by A1 and the message sent to the second disk by A2. Similarly, the second computer sends two messages, B1 and B2 to the first and second disks. Because of network delays, the first disk could receive messages in the order A1 followed by B1 while the second disk receives messages in the order B2 followed by A2. After both disks have been updated, the first disk will have B1 for the record in question, and the second disk will have A2 for the record.
The errors resulting from the scenario discussed above are at least detectable, since the record in question is supposed to be the same of each disk. However, this is not always the case. Consider the case in which a database is spread across multiple disks. An update to the database may require that records on two different disks be updated. Since these records are not mirrors of one another, an inconsistency resulting from network delays, or messages being lost, may not be detectable.
To prevent such errors, any system dealing with multiple disk storage must have two properties. First, either all of the disks must process a message, or none do. Second, if two multi-disk operations are issued concurrently, the resulting disk contents must be the same as would occur if each disk processed its part of the operation in the same order as the other disks.
Three prior art methods have been utilized. The first involves locking the data records to be read or written. Any processor that wants to perform an atomic read or write data first locks the data by sending messages to all of the disks involved. The processor then performs the read or write operation and then unlocks the data. The lock assures that operations occur in the same order on all disks by forcing processors other than the one holding the lock to wait until the lock is released to issue messages involving the affected disk records. This method has a number of problems. First, reading or writing data requires at least three message exchanges on the network between the processor wishing to operate on the data and the disks. Second, each disk must keep track of the locks effecting it and deal with processor failures in which a lock is not released because a processor goes down or has some other error. Third, such systems are subject to xe2x80x9cdeadlocksxe2x80x9d in which a transaction for the lock holder cannot be completed until data is received from a second processor, which is locked out. To prevent such deadlocks, complex systems must be implemented which further reduce the performance of the storage system.
The second method for dealing with disk inconsistencies is to allow writes and reads to occur with a single message and then a check is made for consistency at the end of the processes. In the case of database systems, the consistency check is made during the transaction commitment. If the transaction violates properties discussed above, it must be aborted and re-executed. Such systems perform poorly if conflicts are frequent because the aborting and re-execution is wasteful. In addition, two messages are required per processor in checking for consistency at each operation.
The third method for dealing with disk inconsistencies requires that all participants to a conversation, i.e., all possible processors and disks whose communication overlap, must exchange messages which include timing and coordination information. The overhead in this solution, as measured in time, number of messages, and amount of data that must be transmitted, can be considerable.
Broadly, it is the object of the present invention to provide an improved method for operating a data storage system in which data is updated on multiple storage devices.
It is a further object of the present invention to provide a method that requires fewer messages to be sent than prior art methods for assuring consistency.
It is a still further object of the present invention to provide a method that can correct inconsistent copies of data when messages are lost.
These and other objects of the present invention will become apparent to those skilled in the art from the following detailed description of the invention and the accompanying drawings.
The present invention is a storage system for storing and retrieving data records. The system includes a storage medium, a controller, and a message log. The storage medium stores data records, the data records being indexed by addresses which specify the location of the data records in the storage medium. The controller receives write messages from processors coupled to the controller. Each write message includes a data segment to be written to the storage medium at a specified address, and coordination information specifying a timestamp, and the addresses of other data records on other storage systems that were written in same write operation. The log stores the write messages prior to the data contained therein being written to the storage medium. The log may be stored in a separate memory or by part of the storage medium. The controller includes a clock. Periodically, the controller reads the timestamps of the messages in the log and compares the timestamps to the clock to determine the message having the oldest timestamp. If the oldest message has a timestamp that is less than the controller""s clock value by more than a predetermined amount, the controller writes the data segment contained in the message to the storage medium at the specified address in the message. The controller also receives read messages. Each read message includes information specifying a range of addresses in the storage medium to be read. The controller generates one or more response messages to a read message. Each response message includes an address range in the storage medium that was written in response to a single one of the write messages, the data records stored in the storage medium corresponding to the address range, and the coordination information received in that write message. In one embodiment of the present invention, the read messages also include a time parameter, and the controller only generates response messages corresponding to write messages having timestamps less than the time parameter.