The present invention relates generally to electronic data storage systems, and specifically to nonvolatile storage systems which are able to recover from system failure.
Methods for efficiently storing data, and recovering the stored data in the event of a computer system failure, are known in the art. The methods rely on storing information additional to the data to a non-volatile memory, typically a disk, and using the additional information to recover the stored data when the failure occurs.
U.S. Pat. No. 5,345,575 to English et al., whose disclosure is incorporated herein by reference, describes a disk controller comprising a memory. The memory contains a table mapping logical addresses of data blocks stored on a disk to labels identifying physical storage locations. In addition to writing the data to a storage location, the disk controller writes the associated logical address of each storage location, a time stamp, and data indicating where in a sequence of data blocks a specific data block occurs. The additional information is used to recover from system failures by reading from substantially the whole disk.
U.S. Pat. No. 5,481,694 to Chao et al., whose disclosure is incorporated herein by reference, describes an electronic data storage system comprising a memory, a plurality of magnetic disk units, and a controller. The memory comprises a table cross-referencing logical addresses with physical addresses on the disk units, a list of physical addresses containing obsolete data, and a list of physical addresses for segments on the disk units which are able to receive data. When data are written to the disk units, a tag comprising the logical address and a sequence number for multiblock writes is written with the data. To recover from a system failure, a checkpoint log and checkpoint segments stored on the disk units recover the table and lists.
U.S. Pat. No. 5,708,793 to Franaszek et al., whose disclosure is incorporated herein by reference, describes a method for optimizing a disk for a random write workload. The method comprises maintaining a mapping of logical to physical addresses within a disk controller. Data are written to the disk at a free disk location, the location being chosen to minimize time taken to write to the location.
In an article by de Jonge et al., xe2x80x9cThe Logical Disk: A New Approach to Improving File Systems,xe2x80x9d in Proceedings of the 14th Symposium on Operating Systems Principles, pp. 15-28, December 1993, which is incorporated herein by reference, the authors describe a logical disk wherein an interface is defined to disk storage which separates file management and disk management. The interface uses logical block numbers and block lists, and supports multiple file systems.
In an article by English et al., xe2x80x9cLoge: a self-organizing disk controller,xe2x80x9d in Proceedings of the USENIX Winter 1992 Technical Conference, pp. 237-251, January 1992, which is incorporated herein by reference, the authors describe a system for storing data to a disk using a translation table and an allocation map. A trailer tag comprising a block address and a time stamp is written to the disk together with the stored data. The information in the trailer tag enables the system to recover from a failure.
In an article by Chao et al., xe2x80x9cMime: a high performance parallel storage device with strong recovery guarantees,xe2x80x9d HPL-CSP-92-9, published by Hewlett-Packard Company, November 1992, which is incorporated herein by reference, the authors describe a disk storage architecture similar to that of Loge, as described above. In Mime, the trailer tag comprises a block address, a sequence number for multiblock writes, and a last-packet-in-multiblock-write flag. As in Loge, the trailer tag information enables the system to recover from a failure.
It is an object of some aspects of the present invention to provide apparatus and methods for improved storage of electronic data in a non-volatile memory.
It is a further object of some aspects of the present invention to provide apparatus and methods for improved recovery of data in the event of a failure in a computing system.
In preferred embodiments of the present invention, an enhanced storage system (ESS) for data storage comprises a non-volatile on-disk storage medium which is written to and read from by a disk arm and a disk head, which are typically industry-standard components. The ESS uses data structures which are maintained in volatile memory, some of which data structures are used to generate incremental system data regarding read and write operations to the storage medium. The data structures comprise, inter alia, a table which translates between logical addresses and disk sector addresses, and an allocation bitmap which shows whether a disk sector address is available to be written to. The translation table is referred to by the ESS before any read, write, allocate, or delete, operation to the disk is performed, and the allocation bitmap is updated before and after each write.
The physical locations for successive writes to the disk are allocated so as to maintain the disk arm moving, insofar as possible, in a preferred direction. Each time user data are written to a given block on the disk, a tag containing incremental system data is also written to the same block. The system data are used subsequently, if needed, to enable the system to recover in case a failure, such as a power failure, occurs before the locations of all of the blocks have been written to the disk in a checkpoint operation, described below. (The locations of the blocks are stored in the translation table.) The incremental system data point forward to the next block to be written to, so that blocks are xe2x80x9cchainedxe2x80x9d together and can be conveniently found and recovered.
Periodically and/or on demand, preferably when the disk arm has to move opposite to the preferred direction, the storage system writes checkpoint data to the disk. The checkpoint data comprise the translation table and the allocation bitmap and data pointing to the beginning of a block chain. Most preferably, the checkpoint data are written to a predetermined region of the disk. Thus the checkpoint data can be used as a starting point when recovering from a failure.
The enhanced storage system of the present invention comprises a rich set of disk operations and thus has a number of advantages over systems known in the art:
By having the majority of write operations to the disk occurring on a preferred direction of motion of the disk arm, disk write time is improved. (If most reads are supplied by cache hits, disk write time is optimized.)
In the event of a volatile memory failure, a complete recovery is possible from checkpoint data and incremental system data that have been stored on the disk.
Since the ESS chains together blocks which are written to the disk, recovery from a failure is linear with the number of block write operations since the last checkpoint. Thus recovery takes substantially the same amount of time as was taken for the write operations performed since the last checkpoint, so that recovery time is optimized.
As a natural extension of the forward chaining of blocks, the ESS supports allocation and write, and deletion of blocks that withstand failures, so avoiding leakage of blocks, unlike other methods known in the art.
No extra input or output disk operations are required at the time of reading to or writing from the disk. All information necessary for a complete recovery from a disk failure is incorporated into blocks comprising user data as the data blocks themselves are written to the disk.
All information for a complete disk recovery is written to the disk, so that the disk may be transferred from one disk host and used in another disk host.
In some preferred embodiments of the present invention, a disk is partitioned so that a first part is operated as a data storage system according to the present invention as described herein, and a second part of the disk is operated as a conventional storage system, without special means for failure recovery.
Although some preferred embodiments are described herein with reference to a single disk, in other referred embodiments of the present invention, a plurality of separate disks are operated by a storage system according to the present invention as described herein.
There is therefore provided, in accordance with a referred embodiment of the present invention, apparatus for electronic data storage, including:
a non-volatile memory, adapted to receive a succession of data blocks for storage at respective locations therein; and
a controller, which is configured to convey the succession of data blocks to the non-volatile memory while writing to the non-volatile memory, together with at least some of the data blocks, a pointer value to the location of a subsequent data block in the succession.
Preferably, the apparatus includes a volatile memory which stores one or more data structures containing data indicative of one or more properties of at least some of the data blocks, at least some of which data are written by the controller to the non-volatile memory, so that the contents of the volatile memory can be regenerated from the at least some of the data in the one or more data structures that are stored in the non-volatile memory.
Preferably, one of the data structures includes a translation table which maps logical addresses of the succession of data blocks to respective physical addresses.
Preferably, the controller writes the respective logical addresses to the succession of data blocks.
Further preferably, one of the data structures includes an allocation bitmap which maps an availability of each of the successive locations.
Preferably, one of the data structures includes the pointer value to the location of the subsequent data block in the succession.
Preferably, one of the data structures includes a pointer value to a first location in the succession.
Preferably, the non-volatile memory includes a disk having a disk head, and the controller writes the data blocks to the disk in a series of passes of the disk head over a surface of the disk in a single direction.
Further preferably, each of the series of passes has a checkpoint-number, and one of the data structures includes a value indicative of the checkpoint-number of the current data block in the succession.
Preferably, the controller writes the at least some of the data in the one or more data structures to the non-volatile memory at the conclusion of one or more of the passes of the disk head.
Preferably, the controller writes a type tag indicative of a use of each of the data blocks to each respective data block.
Preferably, the apparatus includes a host server which manages the non-volatile memory is mounted, wherein the host server is able to recover contents of a volatile memory from data written by the controller to the non-volatile memory.
Preferably, the non-volatile memory includes a portion to which the controller does not write the succession of data blocks with the pointer value.
There is further provided, in accordance with a referred embodiment of the present invention, a method for electronic data storage, including:
providing a succession of data blocks for storage at respective locations in a non-volatile memory;
determining for each of at least some of the data blocks in the succession a pointer value to a data block to be written to in a subsequent storage operation; and
storing the succession of the data blocks and the pointer values in the non-volatile memory.
Preferably, the method includes storing in a volatile memory one or more data structures containing data indicative of one or more properties of at least some of the data blocks, and writing at least some of the data that are in the data structures to the non-volatile memory, so that the contents of the volatile memory can be regenerated from the at least some of the data in the one or more data structures that are stored in the non-volatile memory.
Preferably, storing the one or more data structures includes storing a translation table which maps logical addresses of the succession of data blocks to respective physical addresses.
Preferably, the method includes using the translation table to locate a specific data block, so as to read data from the specific data block.
Preferably, storing the one or more data structures includes storing an allocation bitmap which maps an availability of each of the successive locations.
Preferably, writing the at least some of the data to the non-volatile memory includes writing data to one of the succession of data blocks using the steps of:
scanning the one or more data structures to determine an available location in the non-volatile memory;
writing the data and at least some contents of the one or more data structures into the available location; and
updating the one or more data structures responsive to the determined available location.
Preferably, scanning the one or more data structures includes allocating a logical address to the available location.
Preferably, writing data to one of the succession of data blocks includes writing a list of logical addresses of data blocks that are to be deleted.
Preferably, the method includes performing a checkpoint operation including the steps of:
locking the one or more data structures;
writing the contents of the one or more data structures to a checkpoint location in the non-volatile memory; and
altering at least some of the contents of the one or more data structures responsive to writing the contents to the non-volatile memory.
Further preferably, the method includes performing a memory reconstruction operation including the steps of:
reading the contents of the one or more data structures from the non-volatile memory; and
updating the one or more data structures in the volatile memory responsive to the contents.
Preferably, performing the memory reconstruction operation includes reading the contents of all of the one or more data structures written to since performing the checkpoint operation, so that there is no leakage of data blocks.
Preferably, performing the memory reconstruction operation includes reading the contents of all of the one or more data structures written to since performing the checkpoint operation in a time substantially equal to the time taken to write all of the one or more data structures written to since performing the checkpoint operation.
Preferably, writing the contents of the one or more data structures to the non-volatile memory includes writing the contents with a low priority of operation to an alternate checkpoint location.
The present invention will be more fully understood from the following detailed description of the preferred embodiments thereof, taken together with the drawings, in which: