In order to strike a proper balance between performance, reliability, and cost, some disk-based data storage systems temporarily read data from and write data to system memory (e.g., volatile randomly accessible memory), prior to writing data to disk storage. Because system memory is volatile, to prevent data loss in the case of a system failure, non-volatile memory is utilized to store a log of all operations that have been written into system memory, but not yet written to disk storage. Accordingly, the performance increase realized from utilizing system memory for temporarily reading and writing data is achieved without negatively impacting the reliability of the data storage system.
FIG. 1 illustrates an example of a network-attached storage system 10 configured to operate as described above. The storage system 10 provides a high-availability data storage service to one or more clients, such as client 12, over a network 14. As illustrated in FIG. 1, the storage system 10 comprises a system memory 16 and a non-volatile memory with an operations log, such as NVLOG 18. In addition, the storage system 12 is connected to a group of storage devices 20 (e.g., disk drives or disk shelves).
When the storage system 10 receives a write command from the client 12, the storage system 10 logs a write operation in NVLOG 18 and then writes the data to system memory 16 on behalf of the client. If a subsequent client-initiated read command is received at the storage system 10, the storage system 10 reads the data from system memory, or from the storage devices 20, depending on whether the data are in system memory or the storage devices 20. When the system memory 16 reaches some predetermined capacity, or the operations log 18 reaches some predetermined capacity, data previously written to system memory 16 can be written to disk storage 20 and the corresponding operations can be cleared from the operations log, thereby freeing system memory 16 and the operations log 18 for processing new read/write commands.
In the event of a system failure (e.g., such as a power failure), data stored in volatile system memory 16 may be lost. To ensure that the data are not permanently lost, a recovery process can be executed. FIG. 2 illustrates an example of a current recovery process executed during a boot-up routine. For example, referring to FIG. 2, at method operation 22 a recovery process can be initiated during a boot-up procedure (e.g., reboot of the failed system). Generally, the recovery process may be initiated during a first boot-up sequence after the system failure event to mitigate a client attempting to access data stored in system memory and potentially lost during the system failure event. Until the boot-up sequence has completed and the file system has been initialized, client-initiated requests directed to the storage devices 20 of the storage system 10 are not processed. This procedure prevents a client from reading incorrect data, before such data has been returned to its proper state by the recovery routine.
At operation 24, the operations that were previously recorded in the operations log, such as NVLOG 18, of the non-volatile memory are “replayed”. That is, for example, respective operations stored in the operations log are processed to condition the state of the system memory 16 as it was when the failure event occurred. At method operation 26, the system memory (e.g., or the relevant portion thereof) can be flushed (e.g., written) to the storage devices 20 of the storage system 10. At operation 28, the storage system 10 begins processing new client-initiated requests directed to the storage devices 20 of the storage system 10.
As illustrated in FIG. 2 by the dash outlined box with reference number 30, one problem with this approach is that the storage system 10 may not be able to process client-initiated requests during the time that the operations in the operations log are being replayed (e.g., method operation 24) and system memory is being flushed to disk (e.g., method operation 26). Some client applications may be very sensitive to delays and a timeout error during a data storage operation (e.g., a client-initiated read/write operation) may lead the client to fail or malfunction in some manner. Other client applications, for example, such as a stock exchange trading or quotation application, are extremely time sensitive and data storage operations in this example typically have a low latency in order for the application to function properly. Therefore, decreasing a time that the data storage system is unable to process client-initiated requests is desirable.
When two storage systems are configured in a cluster such that one serves as a back-up to the other in the case of a system failure event, a similar problem occurs during the takeover procedure that is initiated after the system failure event. In general, a takeover procedure comprises a surviving storage system (e.g., non-failed) preparing to process client-initiated requests on behalf of a failed storage system. When a takeover procedure takes longer than desired, clients may experience delays and/or timeouts, thereby causing the clients to fail or malfunction in some manner, for example. An example of this problem is illustrated in FIGS. 3 through 5.
FIG. 3 illustrates an example of two data storage systems (e.g., storage system A and storage system B) configured in a cluster such that either system can serve as a back-up system to the other in the event one system fails. For example, during a normal operating mode, respective data storage systems A and B operate independently of each other. In the normal operating mode, storage system A can provide clients with access to storage devices A, and storage system B can provide clients access to storage devices B. Further, for example, storage system A is said to “own” storage devices A, while storage system B “owns” storage devices B.
However, when a system failure occurs at either storage system, a takeover routine can be initiated by the surviving storage system to facilitate clients continuing to access data stored on the data storage devices of the failed storage system. Accordingly, as illustrated in FIG. 3, storage system A is coupled not only to storage devices A, but also to storage devices B. Similarly, storage system B is coupled to both storage devices A and storage devices B. Furthermore, respective storage systems A and B comprise an interconnect adapter (not shown) by which they may be connected to one another via an interconnect cable, for example.
Referring again to FIG. 3, respective storage systems are shown to comprise system memory (e.g., system memory A and B). Further, respective storage systems A and B comprise a non-volatile memory (e.g., NVLOG A and NVLOG B) where an operations log and log mirror may be stored. For example, storage system A can comprise NVLOG A, which may be partitioned to comprise a first portion (e.g., operations log A) for storing operations directed to storage system A, and a second portion (e.g., operations log mirror (B)) for storing operations directed to storage system B. In this example, when a client directs a write command to storage system A, an operation is logged in operations log A of NVLOG A, and the associated data is written to system memory A, where it is stored until a later time when the data is flushed to storage devices A. Additionally, in this example, the operation can be mirrored to operations log mirror (A) of NVLOG B, thereby allowing storage system B to replicate a state of storage system A's system memory (e.g., system memory A), if necessary, during a takeover routine.
Referring now to FIG. 4, one embodiment of storage systems A and B are illustrated where storage system A has experienced a system failure. In this embodiment, storage system B may be referred to as the surviving storage system, while storage system A may be referred to as the failed storage system. When storage system B detects that storage system A has failed, storage system B can initiate a takeover procedure so that storage system B can continue providing clients with access to the data on storage devices A.
FIG. 5 illustrates a current method for performing such a takeover routine associated with a system failure event at one system in a cluster configuration. At method operation 50, data storage system B detects a system failure event at data storage system A. In one embodiment, the system failure event may be detected by B, or alternatively, storage system A may notify storage system B of the system failure event. In any case, after the system failure event, client-initiated requests directed to storage devices A of storage system A can be redirected to, and received at, surviving storage system B. In this example, storage system B, however, may not be able to begin processing client-initiated requests directed to storage devices A until storage system B completes a takeover procedure and initializes a file system of storage devices A. In particular, for example, storage system B can update its system memory to reflect the state of storage system A's system memory at a time storage system A failed. In this example, this can mitigate a client-initiated request directed to data that was stored in storage system A's system memory at the time of the system failure event from being processed with incorrect data from storage devices A.
In this embodiment, to update its system memory, at operation 52, the surviving storage system B “replays” operations stored in the operations log mirror (A) of NVLOG B, thereby writing into system memory B the data contents that were in storage system A's system memory (e.g., system memory A) at the time storage system A failed. At operation 54, the data stored in system memory B (or the relevant portion thereof) are flushed to storage devices A. In one example, the relevant portion of system memory B may comprise the portion of system memory B storing the data generated by replaying the operations log mirror (A). After the operations log mirror (A) has been replayed, and the system memory flushed to storage devices A, storage system B can resume processing new client-initiated requests directed to the storage devices of the failed storage system (e.g., storage devices A of storage system A) at operation 56. In many current or prior systems, after the takeover routine has executed, there is a presumption that the storage system has been reset, for example, back to a clean state. That is, for example, there is a presumption that after the takeover routine, the operations log is empty or free, and that all client data have been committed to storage devices.
As with the recovery procedure described above in connection with FIGS. 1 and 2, the takeover procedure creates a situation in which client-initiated requests are subject to timeout. For example, the dash outlined box with reference number 58 in FIG. 5 indicates time during which the surviving storage system A may not process client-initiated requests. Specifically, storage system B may not process client-initiated requests during a time that the operations in the operations log mirror (A) are being replayed (e.g., method operation 52) and system memory B is being flushed to storage devices A (e.g., method operation 54). Therefore, for example, decreasing the time that the data storage system is unable to process client-initiated requests during a takeover procedure is desirable.