As the value and use of information continues to increase, individuals and businesses seek additional ways to process and store information. One option available to users is information handling systems. An information handling system generally processes, compiles, stores, and/or communicates information or data for business, personal, or other purposes thereby allowing users to take advantage of the value of the information. Because technology and information handling needs and requirements vary between different users or applications, information handling systems may also vary regarding what information is handled, how the information is handled, how much information is processed, stored, or communicated, and how quickly and efficiently the information may be processed, stored, or communicated. The variations in information handling systems allow for information handling systems to be general or configured for a specific user or specific use such as financial transaction processing, airline reservations, enterprise data storage, or global communications. In addition, information handling systems may include a variety of hardware and software components that may be configured to process, store, and communicate information and may include one or more computer systems, data storage systems, and networking systems.
Information handling systems often use storage resources (e.g., hard disk drives, solid state drives, and/or arrays thereof) to store data. To provide resiliency against failure of individual storage resources, many systems employ redundancy of storage resources, oftentimes using Redundant Array of Inexpensive Disks or RAID. One type of RAID often utilized is parity-based RAID, such as RAID 5, for example. In its simplest form, parity-based RAID works by writing stripes of data across three or more physical storage resources, writing a data strip to N−1 of the physical storage resources of the RAID array and a parity strip to one of the physical storage resources of the RAID array, where “N” equals the number of devices in the RAID array. Each written parity strip may be written as the logical exclusive OR (XOR) of the data strips of within the same stripe as the parity strip. Accordingly, if a physical storage resource of a RAID array fails, the data and/or parity stored on the failed storage resource can be rebuilt by performing a logical XOR.
Increasingly, storage systems are employing persistent memory as a write-back cache support for RAID arrays, in order to improve storage performance in terms of throughput and latency. Oftentimes, such persistent memory is implemented with a volatile memory backed by a battery or other energy storage devices and a non-volatile memory. Thus, responsive to a power fault of a power source for powering the write-back cache, the battery may provide the volatile memory with sufficient electrical energy to write cached data to the non-volatile memory, such that when power is again applied, cached data backed up to the non-volatile memory can be used to flush written data to physical storage devices, thus reducing or eliminating data loss that would otherwise occur if cached data was lost before flushed to the physical storage resources.
However, despite the advantages of existing persistent memory-based write-back cache implementations, certain conditions and scenarios exist in which use of a persistent memory-based write-back cache implementation may still lead to data corruption. For example, one scenario in which data corruption could occur is known as a “write-hole problem,” wherein a first power loss occurs, and then a second power loss occurs shortly thereafter during a cache flush of the data stored to non-volatile memory of the write-back cache in response to the first power loss. During this second power loss, another backup of data in the volatile portion of the cache may not take place, as all of the “dirty” cache data in the non-volatile memory of the cache may not be erased until all data is confirmed as having been flushed. Such potential data corruption may occur due to the fact that during the cache flush occurring before the second power event, data may be written to a particular strip but not its associated calculated parity, or the calculated parity may be written to a particular strip, but not its associated data. After the second power event, when another cache flush is attempted from data stored in the non-volatile memory, it is then possible that incorrect parity information or incorrect data information may be used in connection with a read-modify-write operation used to calculate parity information to be written as part of the cache flush occurring after the second power event, which can result in the parity information being calculated incorrectly. Thus, if the RAID subsequently becomes degraded due to a failure of a storage resource within the RAID, the existence of the incorrect parity may lead to incorrect calculation of data for the rebuilt physical storage resource replacing the failed physical storage resource, and accordingly, data may be corrupted.
To illustrate the write whole problem, consider that data D2′ is new data for which parity P′ needs to be calculated as P′=P⊕D2⊕D2′, where P is the “old” parity data, D2 is the data to be overwritten by data D2′ and ⊕ is a logical XOR. In a cache flush, each of D2′ and P′ would need to be written to disk.
If both P′ and D2′ are written during a cache flush, or neither is written, then no data corruption may exist. However, if due to a second abrupt power loss occurring shortly after a first power loss, if P′ is written but D2′ is not, a read-modify-write operation during the cache flush occurring after the second power loss may lead to corruption of parity data: P″=P′⊕D2⊕D2′. On the other hand, if D2′ is written and P′ is not, a read-modify-write operation during the cache flush occurring after the second power loss may lead to corruption of parity data: P″=P⊕D2⊕D2. Thus, a subsequent rebuild or data regeneration using the P″ parity would lead to data corruption.