The present invention relates generally to data transactions, and specifically to distributed transaction support of data written to a non-volatile memory.
A transaction is a sequence of information exchange and related work (such as database updating) that is treated as a unit of atomicity for ensuring data integrity. In a transaction, data is transformed from one consistent state to another. For a transaction to be completed and data changes to be made permanent (or xe2x80x9ccommittedxe2x80x9d), a transaction has to be made atomic with respect to failure, i.e., it has to be completed in its entirety. If something happens before the transaction is successfully completed, the transaction is aborted, and any changes to the data must be undone, so that the effect is as if the transaction never existed. Hereinbelow, the term xe2x80x9ccompletexe2x80x9d when applied to a transaction meansi that the transaction is either committed or aborted.
The inherent difficulty of transaction support is exacerbated when participants in the transaction are part of a distributed system. It is then necessary to ensure that the transaction is committed or that it is aborted atomically and consistently by all of the participants. For example, some of the participants in a transaction may fail, and it is possible in a distributed system that some of the other participants may not know of the failure. Also, participants who have recovered after a failure must determine;the fate of the transaction.
In Concurrency Control and Recovery in Database Systems, by Bernstein et al. (Addison-Wesley, 1987), which is incorporated herein by reference, a description is given in chapter 7 of atomic commitment protocols (ACPs) which ensure transaction consistency over multiple sites of a distributed system. The authors describe a two-phase-commit (2PC) protocol as an example of an ACP. The 2PC protocol comprises a first phase wherein all participants of a transaction are polled as to whether the transaction should be committed or aborted. In a second phase of the 2PC protocol a coordinator of the transaction decides, on the basis of the poll, if the transaction is to be committed or aborted, and transmits that decision to the participants.
Methods for efficiently storing data, and recovering the stored data in the event of a computer system failure, are known in the art. The methods rely on storing information additional to the data to a non-volatile memory, typically a disk, and using the additional information to recover the stored data when the failure occurs.
U.S. Pat. No. 5,345,575 to English et al., whose disclosure is incorporated herein by reference, describes a disk controller comprising a memory. The memory contains a table mapping logical addresses of data blocks stored on a disk to labels identifying physical storage locations. In addition, to writing the data to a storage location, the disk controller writes the associated logical address of each storage location, a time stamp, and data indicating where in a sequence of data blocks a specific data block occurs. The additional information is used to recover from system failures by reading from substantially the whole disk.
U.S. Pat. No. 5,481,694 to Chao et al., whose disclosure is incorporated herein by reference, describes an electronic data storage system comprising a memory, a plurality of magnetic disk units, and a controller. The memory comprises a table cross-referencing logical addresses with physical addresses on the disk units, a list of physical addresses containing obsolete data, and a list of physical addresses for segments on the disk units which are able to receive data. When data are written to the disk units, a tag comprising the logical address and a sequence number for multiblock writes is written with the data. To recover from a system failure, a checkpoint log and checkpoint segments stored on the disk units recover the table and lists.
In an article by de Jonge et al., xe2x80x9cThe Logical Disk: A New Approach to Improving File Systems,xe2x80x9d in Proceedings of the 14th Symposium on Operating Systems Principles, pp. 15-28 (December 1993), which is incorporated herein by reference, the authors describe a logical disk wherein an interface is defined for disk storage which separates file management and disk management. The interface uses logical block numbers and block lists, and supports multiple file systems. The authors claim to support an Atomic Recovery Unit (ARU). During recovery all logical disk commands belonging to the same ARU are treated as a single invisible operation. Thus, the logical disk will always recover to either a state that existed before, or to a state that existed after performing all operations of an ARU. However, concurrent ARUs are not supported.
In an article by English et al., xe2x80x9cLoge: a self-organizing disk controller,xe2x80x9d in Proceedings of the USENIX Winter 1992 Technical Conference, pp. 237-251 (January 1992), which is incorporated herein by reference, the authors describe a system for storing data to a disk using a translation table and an allocation map. A trailer tag comprising a block address and a time stamp is written to the disk together with the stored data. The information in the trailer tag enables the system to recover from a failure.
In an article by Chao et al., xe2x80x9cMime: a high performance parallel storage device with strong recovery guarantees,xe2x80x9d HPL-CSP-92-9 (published by the Hewlett-Packard Company, November 1992), which is incorporated herein by reference, the authors describe a disk storage architecture similar to that of Loge, as described above. In Mime, the trailer tag comprises a block address, a sequence number for multiblock writes, and a last-packet-in-multiblock-write flag. As in Loge, the trailer tag information enables the system to recover from a failure.
Mime supports atomic multi-block writes with a limited form of transaction support in the form of a visibility group. Mime guarantees that in the case of a failure all block writes within an active visibility group are aborted.
It is an object of some aspects of the present invention to provide an improved system for supporting data transactions.
It is a further object of some aspects of the present invention to; provide an improved system for performing concurrent data transactions when participants in the transaction are distributed over a network.
In preferred embodiments of the present invention, one or more storage devices, preferably non-volatile disks, are used for storing data contents of transactions initiated by one or more clients of the storage devices. Each of the storage devices is managed by control circuitry, preferably a storage server, which writes the data contents of the transactions to selected block-frames of the storage device. Such storage devices are herein termed transaction supporting logical disks (TSLDS). The storage servers have volatile memory in which they hold data structures whose values are used, inter alia, to track transaction data written to the TSLDs and to link dynamically the physical and logical addresses of the block-frames to which the data are written.
In order to safeguard a TSLD against failure of a specific storage server, values in the data structures of the server are stored to that TSLD at periodic intervals, using checkpoint operations. Between checkpoint operations, values in the data structures are also stored together with the data contents of the transactions in the block-frames of each TSLD. Preferably, values in the data structures, and the checkpoint data, enable block-frames to be conveniently found in the event of a storage server failure, so that the data contents within the block-frames can be recovered. In the event of a failure, the storage server reads the stored checkpoint data and xe2x80x9creplaysxe2x80x9d the process of TSLD operations, including committing and aborting transactions, since the last checkpoint was performed. The replaying process enables the storage server to recover its state and the state of any ongoing transactions at the time of failure. Alternative methods ;of storage and of recovery are described in the above-mentioned provisional patent application and in another patent application entitled xe2x80x9cEnhanced Stable Disk Storage,xe2x80x9d filed Jul. 6, 2000, which is assigned to the assignee of the present application, and whose disclosure is incorporated herein by reference.
Preferred embodiments of the present invention preferably support a two-phase-commit (2PC) protocol, as described generally in the Background of the Invention. In a first phase 6f a transaction, a specific TSLD client, called the transaction coordinator, initiates the transaction by designating one of the TSLDs participating in the transaction to be a primary TSLD. The primary TSLD acts as a recorder by tracking the state of the transaction, and the client instructs the primary TSLD to begin the transaction. The client instructs the other TSLDs participating in the transaction, via their managing servers, to xe2x80x9csoft-writexe2x80x9d appropriate data contents of the transaction to the respective TSLDs, thus generating non-commltted block values, and to inform the client in each case that this has been done. In a soft-write operation the logical and physical addresses of block-frames written to are provisionally linked. Once the client knows that the first phase has been successfully accomplished, in a second phase it instructs the primary TSLD to commit the transaction and then the client issues commit instructions to each of the participating TSLDs to commit the transaction, thus generating committed block values. Commit operations are performed by permanently linking the provisionally linked addresses. If the first phase is not successfully completed within a predetermined time period, the client instructs the primary TSLD to abort the transaction, and then the client issues abort instructions to each of the participating TSLDs to abort the transaction, whereupon the address linkage existing before the transaction is reinstated.
The system described above has a number of significant advantages in transaction operations over systems for transactions known in the art. Specifically:
The same information in the data structures is used to enable a TSLD to recover substantially completely from server failure, and to allow an efficient implementation of a two-phase-commit protocol.
The system supports a distributed transaction over multiple TSLDs, and also supports concurrent data transactions on one or more TSLDs.
There is no central log of transaction states. Each transaction may select a different primary TSLD, and thus transaction state logging is distributed.
A TSLD supports concurrent active transactions and can recover from multiple transaction failures.
Because all the information in the data structures is stored to a TSLD, TSLDs are movable from a failed server to a working server, and are then able to regenerate address linkages and information about open transactions.
The incremental storage of data structure information to a TSLD is performed with no extra input/output operations to the TSLD.
Soft-write operations, together with committing and aborting of transactions, are supported even when the system comprises a plurality of disks and/or a plurality of servers distributed over a network.
Checkpoint operations may be implemented in a flexible manner, for example as a background operation or by identifying components in the data structures which have changed since the last checkpoint.
Recovery of a TSLD after a failure is fast, since only block-frames stored since the last checkpoint operation of this TSLD need to be read.
Recovery of la TSLD after a failure generates sufficient information about open transactions to enable the participants of the transaction to query the transaction state and achieve consensus whether to commit or abort the transaction. In the case of failure of a client, there is sufficient information on the TSLDs to commit or abort the transaction, which information may be acted on by other clients.
There is therefore provided, according to a preferred embodiment of the present invention, apparatus for supporting a data transaction, including:
at least one TSLD which is adapted to receive data as a succession of data blocks for storage at respective physical locations therein; and
control circuitry, which is configured to convey the succession of data blocks to the at least one TSLD while writing to the at least one TSLD, together with at least some of the data blocks, at least one parameter indicative of a physical location of a subsequent data block in the succession, and to transform a state of the data so as to complete the data transaction using the at least one parameter.
Preferably, the control circuitry includes at least one communication link which couples the control circuitry to the at least one TSLD, and the control circuitry manages the at least one TSLD via the at least one link.
Preferably, the apparatus includes a data transmission network which conveys the data to the control circuitry and which is coupled thereto.
Further preferably, the data transmission network is coupled to and conveys the data to the at least one TSLD, and the control circuitry manages the at least one TSLD via the data transmission network.
Preferably, the apparatus includes a client which conveys the data to the control circuitry.
Preferably, the control circuitry includes at least one volatile memory which is adapted to store one or more data structures containing data indicative of one or more properties of at least some of the data blocks, at least some of which data are written by the control circuitry to the at least one TSLD, so that the contents of the at least one volatile memory can be regenerated from the at least some of the data, in the one or more data structures that are stored in the at least one TSLD.
Further preferably, one of the data structures includes a translation table which maps logical block addresses of the succession of data blocks to respective physical addresses and which tags which of the data blocks is currently engaged in an open data transaction.
Preferably, one of the data structures includes an auxiliary translation table which for each respective open data transaction maps logical block addresses associated with an open transaction to respective physical addresses of the succession of data blocks existing prior to initiation of the data transaction and to an identity of the open data transaction.
Preferably, the control circuitry writes the respective logical block addresses to the succession of data blocks.
Preferably, the control circuitry writes an identity of a data transaction to the succession of data blocks.
Preferably, the control circuitry writes an identity of a recorder of a state of a data transaction to the succession of data blocks.
Preferably, one of the data structures includes an allocation bitmap which maps an availability of each of the succession of data blocks.
Further preferably, one of the data structures includes a pointer value to the location of a next available data block in the succession.
Preferably, one of the data structures includes a pointer value to the location of a first available block in the succession.
Preferably, the at least one TSLD includes a disk having a disk head, and the control circuitry writes the data blocks to the disk in a series of passes of the disk head over a surface of the disk in a single direction.
Further preferably, each of the series of passes has a checkpoint-number, and one of the data structures includes a value indicative of the current checkpoint-number.
Preferably, one of the data structures includes a working allocation bitmap copy which stores an availability of each of the successive data blocks at the conclusion of one or more passes of the disk head.
Preferably, the control circuitry writes the at least some of the data in the one or more data structures to the at least one TSLD at the conclusion of one or more of the passes of the disk head.
Preferably, one of the data structures includes a transaction state log which is adapted to store at least one record of the data transaction, wherein the at least one record includes an identity of the data transaction and an identity of the at least one TSLD participating in the data transaction, and wherein the at least one record includes at least one of a group of records consisting of: begin-transaction, commit-transaction, and abort-transaction, which respectively record that the transaction has begun, is committed, and is aborted.
Preferably, the control circuitry writes a trailer to the succession of data blocks, the trailer including one or more data fields indicative of a use of each of the data blocks.
Preferably, the one or more data fields include an identity of the data transaction and an identity of a primary TSLD which acts as a recorder of the data transaction state.
Preferably, the one or more data fields include a logical address of the data block.
Preferably, the one or more data fields include a tag indicative of a state of the data transaction.
Preferably, the apparatus includes a disk controller which controls the at least one TSLD, wherein the disk controller is able to recover contents of a volatile memory from data written by the control circuitry to the at least one TSLD.
Preferably, the at least one parameter includes a plurality of parameters.
Preferably, the data transaction comprises a plurality of concurrent data transactions.
Preferably, the at least one TSLD comprises a disk managed by a server.
Preferably, the at least one TSLD comprises a plurality of TSLDs. Preferably, the at least one parameter includes a transaction state log which records one or more states of the data transaction.
Preferably, at least one parameter includes at least a part of a distributed transaction state log which records one or more states of the data transaction.
Preferably, the at least one storage device is adapted to be movable so as to be operative by other control circuitry, in order to recover information stored on the at least one storage device.
There is further provided, according to a preferred embodiment of the present invention, a method for performing a data transaction, including:
receiving data in at least one TSLD as a succession of data blocks, and, together with each of at least some of the data blocks, at least one parameter indicative of a physical location of a subsequent data block in the succession; and
transforming a state of the data in the at least one TSLD, responsive to the at least one parameter, so as to complete the data transaction.
Preferably, the at least one parameter includes a plurality of parameters.
Preferably, receiving data in the at least one TSLD includes conveying the data to the at least one TSLD via a data transmission network.
Preferably, conveying the data inlcudes coupling the at least one TSLD to the data transmission network.
Preferably, conveying the data includes:
coupling the at least one TSLD to the data transmission network via control circuitry; and
coupling the control circuitry to the data transmission network.
Further preferably, receiving the data includes receiving data from; a client.
Preferably, the method includes storing in at least one volatile memory one or more data structures containing data indicative of one or more properties of at least some of the data blocks, and writing at least some of the data that are in the data structures to the at least one TSLD, so that the contents of the at least one volatile memory can be regenerated from the at least some of the data in the one or more data structures that are stored in the at least one TSLD.
Preferably, storing the one or more data structures includes storing a translation table which maps logical block addresses of the succession of data blocks to respective physical addresses and which tags which of the data blocks is currently engaged in an open data transaction.
Preferably, storing the one or more data structures includes storing an auxiliary translation table which for each respective open data transaction maps logical block addresses associated with an open transaction to respective physical addresses of the succession of data blocks existing prior to initiation of the data transaction and to an identity of the open data transaction.
Preferably, receiving the data includes writing the respective logical addresses to the succession of data blocks.
Preferably, the method includes using the translation table to locate a specific data block, so as to read data from the specific data block. Preferably, storing the one or more data structures includes storing an allocation bitmap which indicates an availability of each of the successive locations.
Further preferably, writing the at least some of the data to the at least one TSLD includes writing data to one of the succession of data blocks using the steps of:
scanning the one or more data structures to determine a physical location of an available data block in the at least one TSLD;
writing the data and at least some contents of the one or more data structures into the physical location; and
updating the one or more data structures responsive to the determined physical location.
Preferably, scanning the one or more data structures includes checking if a logical block address is involved in an open transaction.
Preferably, writing the data and at least some contents of the one or more data structures includes the steps of:
writing a value corresponding to a soft-write operation; and
writing an identity of one of the at least one TSLD as a primary TSLD, which acts as a recorder of the data transaction, so as to perform a temporary write operation.
Preferably, writing the data and at least some contents of the one or more data structures includes writing a value corresponding to a begin operation, and updating the one or more data structures includes writing a begin record including an identity of the transaction and an identity of the primary TSLD, so as to perform the begin operation.
Preferably, writing the data and at least some contents of the one or more data structures includes writing a value corresponding to a commit operation, and updating the one or more data structures includes:
removing a tag indicative of an open state of the transaction;
writing a commit record including an identity of the transaction and an identity of the primary TSLD, so as to perform the commit operation; and
updating a transaction state log.
Preferably, writing the data and at least some contents of the one or more data structures includes writing a value corresponding to an abort operation, and updating the one or more data structures includes:
removing a tag indicative of an open state of the transaction;
writing an abort record including an identity of the transaction and an identity of the primary TSLD, so as to perform the abort operation; and
updating a transaction state log.
Preferably, the method includes scanning the one or more data structures to determine a state of the data transaction.
Preferably, the method includes performing a checkpoint operation inlcuding the steps of:
locking the one or more data structures;
writing the contents of the one or more data structures to a checkpoint location in the at least one TSLD; and
altering at least some of the contents of the one or more data structures responsive to writing the contents to the at least one TSLD.
Preferably the method includes performing a recovery operation including the steps of:
reading the contents of the one or more data structures from the at least one TSLD; and
updating the one or more data structures in the at least one volatile memory responsive to the contents.
Preferably, performing the recovery operation includes reading the contents of all of the one or more data structures written to since performing the checkpoint operation.
Further preferably, performing the recovery operation includes reading the contents of all of the one or more data structures written to since performing the checkpoint operation in a time not substantially greater than the time taken to write all of the one or more data structures written to since performing the checkpoint operation.
Preferably, performing the recovery operation includes determining a state of an open transaction and enabling one or more clients of the open transaction to query the state and decide whether to commit or abort the open transaction.
Preferably, the method includes performing a transaction coordination operation including the steps of:
choosing one of the at least one TSLDs to be a recorder of a state of the data transaction;
submitting a begin transaction operation to the recorder;
waiting for a predetermined time for a response, indicative of the state of the data transaction, from the recorder;
conditional on the response, submitting one or more soft-write operations to the at least one TSLD; and
conditional on the response, deciding whether to commit or abort the transaction, by issuing a commit or abort transaction operation. Further preferably, the method includes performing a transaction completion operation including the steps of:
reading a first set of the succession of data blocks from the at least one TSLD containing contents indicative of a state of the data transaction; and
responsive to the contents of the first set of the succession of data blocks, writing a second set of the succession of data blocks to the at least one TSLD and storing a set of the one or more data structures to the at least one volatile memory.
Preferably, the data transaction comprises a plurality of concurrent data transactions.
Preferably, the at least one TSLD comprises a disk managed by a server.
Preferably, the at least one TSLD comprises a plurality of TSLDs.
There is further provided, according to a preferred embodiment of the present invention, apparatus for electronic data storage, including:
a TSLD, adapted to receive a succession of data blocks for storage at respective physical locations therein; and
control circuitry, which is configured to convey the succession of data blocks to the TSLD while writing to the TSLD, together with at least some of the data blocks, one or more parametersindicative of the physical location of a subsequent data block in the succession.
Preferably, the one or more parameters include a plurality of parameters.
There is further provided, according to a preferred embodiment of the present invention, a method for electronic data storage, including:
providing a succession of data blocks for storage at respective physical locations in a TSLD;
determining for each of at least some of the data blocks in the succession one or more parameters indicative of the physical location of a subsequent data block in the succession; and
storing the succession of the data blocks and the one or more parameters in the TSLD.
Preferably, the one or more parameters include a plurality of parameters.
There is further provided, according to a preferred embodiment of the present invention, a computer software product for performing a data transaction, including a computer-readable medium having program instructions recorded therein, which instructions, when read by a computer, cause the computer to receive data in at least one TSLD as a succession of data blocks, and, together with each of at least some of the data blocks, at least one parameter indicative of a physical location of a subsequent data block in the succession, and to transform a state of the data, in the at least one TSLD, responsive to the at least one parameter, so as to complete the data transaction.
There is further provided, according to a preferred embodiment of the present invention, a computer software product for electronic data storage, including a computer-readable medium having program instructions recorded therein, which instructions, when read by a computer, cause the computer to provide a succession of data blocks for storage at respective physical locations in a TSLD, to determine for each of at least some of the data blocks in the succession one or more parameters indicative of the physical location of a subsequent data block in the succession, and to store the succession of the data blocks and the one or more parameters in the TSLD.