Some storage systems utilize a transactions log in a non-volatile memory device to record a log of client transactions that have been processed by the storage system, but not yet committed to disk storage. For instance, a client-initiated request to write data to a file may be processed by a storage system by writing the data to a block data cache in system memory and recording the client-initiated write request in the transactions log. Accordingly, if the storage system encounters a system failure event before the storage system “flushes” (e.g., writes) to disk storage that portion of the active file system stored in the cache, the transactions log can be processed to recover any active file system data that may have been in system memory—and therefore lost—during the system failure event. In this context, “active file system data” is simply the most current file system data. For example, if a client-initiated write command results in new data modifying or replacing old data, the new data is the active file system data.
Processing the transactions log in this manner is often referred to as “replaying” the transactions log. This replay process generally occurs during file system initialization in one of two contexts. In a stand-alone storage system, the replay process occurs during the first boot-up of the storage system after a system failure event. For instance, during the first boot-up after a system failure event, the storage system processes the transactions recorded in the transactions log to condition the state of the block data cache in system memory as it was when the system failure event occurred. Next, the storage system flushes the data contents of the block data cache to disk storage, thereby creating a consistency point for the active file system. Accordingly, any data in the system memory (e.g., block data cache) that may have been lost, during the system failure event are recovered by replaying the appropriate transactions recorded in the transactions log, and flushing the resulting data to disk storage.
In the context of a high-availability clustered configuration (sometimes referred to as a failover duster), where two storage systems are configured such that one storage system will take over and process client requests on behalf of another storage system in the event one system fails, the replay process occurs during a failover or takeover procedure. For instance, after a system failure event at one storage system the surviving storage system replays the transactions in the transactions log of the failed storage system (or a transactions log mirror) to generate any active file system data that may have been in the block data cache in system memory of the failed storage system—and therefore lost—at the time of the system failure event. The data generated by replaying the transactions in the failed storage system's transactions log are written into the surviving storage system's system memory, and then flushed to the failed storage system's disk storage as part of the takeover procedure. Once the active file system data of the failed storage system has been flushed to disk storage, the surviving storage system begins processing client-initiated requests on behalf of the failed storage system.
Whether the replay process is part of a recovery process in a stand-alone configuration or part of a takeover process in a high-availability clustered configuration, the replay process is perceived as “down time” by client applications. That is, the storage system is non-responsive to client-initiated requests during the replay process. This is problematic as many clients do not expect, or cannot properly handle, delays and/or timeouts in the servicing of requests directed to highly reliable storage systems. For example, some client applications, such as stock exchange trading or quotation applications, are extremely time sensitive and require low latency for data storage operations. Other client applications may fail or malfunction in some manner if a client request is not serviced due to the replay process taking too long to complete.
One reason the replay process may take a long time to complete is due to the number and nature of disk read and disk write operations that must be processed during the transactions log replay procedure. Generally, the inure disk read and disk write operations that must be processed during the transactions log, replay procedure, the longer it will take for the file system to be initialized and the storage system to begin processing client-initiated commands.
Although both disk read and disk write operations can delay the completion of the transactions log replay procedure, the delay due to disk read operations is often more significant with certain data storage systems. For instance, because the replay procedure occurs after a system failure event, those data storage systems that implement block data caches will likely perform a significant number of disk read operations because the contents of system memory (and therefore the cache) are essentially empty. For instance, that portion of system memory dedicated for use as a data block cache does not contain any file system data a situation commonly referred to as a cold cache. Accordingly, during a recovery or takeover procedure after a system failure, nearly all read operations will result in data being read from disk. Due to the seek time required by the reading component of the storage device (e.g., disk drive head) to search for and locate individual disk blocks, disk read operations can be particularly costly in terms of time.
Furthermore, some data storage systems implement file systems based on a copy-on-write, transactional object model. For example, storage systems from Network Appliance® of Sunnyvale, Calif. generally implement a file system known as WAFL®, which uses a copy-on-write, transactional object model. With such systems, blocks containing active file system data are never overwritten in place; instead, a new block is allocated, modified data is written to it, and then any metadata blocks referencing it are similarly read, reallocated, and written. To reduce the overhead of this process and improve overall efficiency, the data from multiple write commands are grouped together and written to disk at once. For example, data from several client-initiated write commands are first processed into system memory, thereby enabling the storage system to efficiently organize the data prior to writing the data to disk. As a consequence, the system overhead associated with allocating disk blocks and writing data to the newly allocated disk blocks is minimal. However, the process of freeing those disk blocks that are no longer storing active file system data (as a result of replaying, the transactions log) can delay completion of replaying the transactions log.
Generally, in order to free disk blocks, a data structure indicating the status of the blocks (e.g., allocated or free) must be read into system memory, modified, and then written back to disk. Consequently, with some storage systems, the operations required for freeing disk blocks during a transactions to replay results in multiple disk reads, where each block being read is in a different location on disk. Both the number and nature of the resulting disk reads can cause a significant delay in the completion of replaying a transactions log.