Traditional data storage controller devices can store data redundantly, across multiple data storage devices. A data storage controller device may employ various forms of data storage devices, e.g., hard disk drives, solid state drives, tape devices, etc. The data storage devices are typically implemented as one or more storage volumes that comprise a cluster of data storage devices, in which the volumes define an overall logical arrangement of storage space. To improve performance, these data storage controllers can temporarily store various data storage commands they receive from client computing devices in a region of local system memory and/or the system memory of other storage controllers that it can communicate with within a cluster. However, the data in system memory is volatile and can be lost before the data gets stored persistently to the data storage devices, e.g. in case of a power failure. To reduce the likelihood of data loss in such circumstances, the storage controller may also store the data in non-volatile random access memory (NVRAM), typically in form of a log file or a journal. By logging the incoming data modifying operations in NVRAM, the storage controller is able to immediately return the acknowledgement back to the client computing devices rather than wait for the operation to persistently make it to the slower data storage devices.
NVRAM log file can accumulate storage operations until a consistency point (CP) is triggered. CP's are triggered at specific time intervals or at specific events (e.g. NVRAM is almost full). At each CP the data is committed from the storage controller system memory to underlying data storage and the NVRAM is cleared of the log of temporary data modifying commands.
Typically the nodes in the cluster are also paired to form high-availability (HA-pair) zones, such that during normal operation each node in an HA-pair mirrors the NVRAM operations to its respective partner. If one node in a HA-pair gets interrupted unexpectedly, e.g., because of power failure or other problems, the system is able to recover generally by having the failing node's HA-partner taking over its storage devices, committing the temporarily staged data operations in NVRAM to the persistent storage devices (also referred to as “replay of NVRAM log”) and start serving the data owned by the failing node. In case where there is no partner node, the failing node reboots and performs the same tasks including replay of NVRAM log before it can start serving the data again.
However, while the partner node is taking over the failed node and is in process of committing or replaying NVRAM log, the partner node may need to warm its cache by accessing and retrieving some metadata (related to journaled operations) from the persistent storage devices (typically hard disks) of the failed storage controller which can be orders of magnitude slow. As a result, client devices can experience a significant and noticeable outage window during which requested data or the files in the failed storage node are inaccessible. With technological advances, the higher NVRAM size have become more prevalent causing undesirable increase in the time it takes to replay the NVRAM log thereby leading to higher recovery times.