In order to strike a proper balance between performance, reliability, and cost, some disk-based data storage systems temporarily read data from and write data to system memory (e.g., volatile randomly accessible memory), prior to writing data to disk storage. Because system memory is volatile, to prevent data loss in the case of a system failure, non-volatile memory is utilized to store a log of all operations that have been written into system memory, but not yet written to disk storage. Accordingly, the performance increase realized from utilizing system memory for temporarily reading and writing data is achieved without negatively impacting the reliability of the data storage system.
FIG. 1 illustrates an example of a network-attached storage system 10 configured to operate as described above. The storage system 10 provides a high-availability data storage service to one or more clients, such as client 12, over a network 14. As illustrated in FIG. 1, the storage system 10 includes a system memory 16 and a non-volatile memory with an operations log, NVLOG 18. In addition, the storage system 12 is connected to a group of storage devices 20 (e.g., disk drives or disk shelves).
When the storage system 10 receives a write command from the client 12, the storage system 10 logs a write operation in NVLOG 18 and then writes the data to system memory 16 on behalf of the client. If a subsequent client-initiated read command is received at the storage system 10, the storage system 10 reads the data from system memory, or from the storage devices 20, depending on whether the data are in system memory or the storage devices 20. When the system memory 16 reaches some predetermined capacity, or the operations log 18 reaches some predetermined capacity, data previously written to system memory 16 are written to disk storage 20 and the corresponding operations are cleared from the operations log, thereby freeing system memory 16 and the operations log 18 for processing new read/write commands.
In the event of a system failure (e.g., such as a power failure), the data stored in volatile system memory 16 may be lost. To ensure that the data are not permanently lost, a recovery process is executed. FIG. 2 illustrates a prior art example of a recovery process executed during a boot-up routine. For instance, referring to FIG. 2, at method operation 22 a recovery process is initiated during the boot-up procedure. Generally, the recovery process is initiated during the first boot-up sequence after the system failure event to ensure that a client does not attempt to access data stored in system memory and lost during the system failure event. Until the boot-up sequence has completed and the file system has been initialized, client-initiated requests directed to the storage devices 20 of the storage system 10 are not processed. This prevents a client from reading incorrect data, before such data has been returned to its proper state by the recovery routine.
At operation 24, the operations that were previously recorded in the operations log, NVLOG 18, of the non-volatile memory are “replayed”. That is, each operation stored in the operations log is processed to condition the state of the system memory 16 as it was when the failure event occurred. Next, at method operation 26, the system memory (or the relevant portion thereof) is flushed (e.g., written) to the storage devices 20 of the storage system 10. Finally, at operation 28 the storage system 10 begins processing client-initiated requests directed to the storage devices 20 of the storage system 10.
As illustrated in FIG. 2 by the dash outlined box with reference number 30, one problem with this approach is that the storage system 10 cannot process client-initiated requests during the time that the operations in the operations log are being replayed (e.g., method operation 24) and system memory is being flushed to disk (e.g., method operation 26). Some client applications may be very sensitive to delays and a timeout error during a data storage operation (e.g., a client-initiated read/write operation) may lead the client to fail or malfunction in some manner. Other client applications—for example, such as a stock exchange trading or quotation application—are extremely time sensitive and therefore any data storage operation is required to have a low latency in order for the application to function properly. Therefore, decreasing the time that the data storage system is unable to process client-initiated requests is desirable.
When two storage systems are configured in a cluster such that one serves as a back-up to the other in the case of a system failure event, a similar problem occurs during the takeover procedure that is initiated after the system failure event. In general, a takeover procedure is the procedure by which a surviving storage system prepares to process client-initiated requests on behalf of the failed storage system. When a takeover procedure takes too long, clients may experience delays and/or timeouts, thereby causing the clients to fail or malfunction in some manner. This problem is illustrated in FIGS. 3 through 5.
FIG. 3 illustrates an example of two data storage systems (e.g., storage system A and storage system B) configured in a cluster such that either system can serve as a back-up system to the other in the event one system fails. For instance, during normal operating mode, each data storage system A and B operates independently of the other. In normal operating mode, storage system A provides clients with access to storage devices A, and storage system B provides clients access to storage devices B. Storage system A is said to “own” storage devices A, while storage system B “owns” storage devices B. However, in the case that a system failure occurs at either storage system, a takeover routine is initiated by the surviving storage system to ensure that clients can continue to access data stored on the data storage devices of the failed storage system. Accordingly, as illustrated in FIG. 3, storage system A is coupled not only to storage devices A, but also to storage devices B. Similarly, storage system B is coupled to both storage devices A and storage devices B. Furthermore, each of storage systems A and B includes an interconnect adapter (not shown) by which they are connected to one another via an interconnect cable.
Referring again to FIG. 3, each storage system is shown to include its own system memory (e.g., system memory A and B). In addition, each storage system A and B has a non-volatile memory (e.g., NVLOG A and NVLOG B) where an operations log and log mirror are stored. For example, storage system A is shown to include NVLOG A, which is partitioned to include a first portion (e.g., operations log A) for storing operations directed to storage system A, and a second portion (e.g., operations log mirror (B)) for storing operations directed to storage system B. When a client directs a write command to storage system A, an operation is logged in operations log A of NVLOG A, and the associated data is written to system memory A, where it is stored until a later time when the data is flushed to storage devices A. In addition, the operation is mirrored to operations log mirror (A) of NVLOG B. This allows storage system B to replicate the state of storage system A's system memory (e.g., system memory A), if necessary, during a takeover routine.
Referring now to FIG. 4, storage systems A and B are shown after storage system A has experienced a system failure. Accordingly, storage system B is referred to as the surviving storage system, while storage system A is referred to as the failed storage system. When storage system B detects that storage system A has failed, storage system B initiates a takeover procedure so that storage system B can continue providing clients with access to the data on storage devices A.
FIG. 5 illustrates a prior art method for performing such a takeover routine associated with a system failure event at one system in a cluster configuration. At method operation 50 data storage system B detects a system failure event at data storage system A. The system failure event may be detected by B, or alternatively, storage system A may notify storage system B of the system failure event. In any case, after the system failure event, client-initiated requests directed to storage devices A of storage system A will be redirected to and received at surviving storage system B. Storage system B, however, cannot begin processing client-initiated requests directed to storage devices A until storage system B has completed the takeover procedure and initialized the file system of storage devices A. In particular, storage system B must update its system memory to reflect the state of storage system A's system memory at the time storage system A failed. Otherwise, a client-initiated request directed to data that was stored in storage system A's system memory at the time of the system failure event may be processed with incorrect data from storage devices A.
To update its system memory, at operation 52 the surviving storage system B “replays” the operations in the operations log mirror (A) of NVLOG B, thereby writing into system memory B the data contents that were in storage system A's system memory (e.g., system memory A) at the time storage system A failed. Next, at operation 54, the data stored in system memory B (or the relevant portion thereof) are flushed to storage devices A. In this case, the relevant portion of system memory B is the portion of system memory B storing the data generated by replaying the operations log mirror (A). Finally, once the operations log mirror (A) has been replayed, and the system memory flushed to storage devices A, storage system B can resume processing client-initiated requests directed to the storage devices of the failed storage system (e.g., storage devices A of storage system A) at operation 56. In many prior art systems, after the takeover routine has executed, there is a presumption that the storage system has been reset, for example, back to a clean state. That is, there is a presumption that after the takeover routine, the operations log is empty or free, and that all client data have been committed to storage devices.
As with the recovery procedure described above in connection with FIGS. 1 and 2, the takeover procedure creates a situation in which client-initiated requests are subject to timeout. For instance, the dash outlined box with reference number 58 in FIG. 5 indicates time during which the surviving storage system A cannot process client-initiated requests. Specifically, storage system B cannot process client-initiated requests during the time that the operations in the operations log mirror (A) are being replayed (e.g., method operation 52) and system memory B is being flushed to storage devices A (e.g., method operation 54). Therefore, decreasing the time that the data storage system is unable to process client-initiated requests during a takeover procedures is desirable.