Multi-device storage systems utilize multiple discrete storage devices, generally disk drives (solid-state drives, hard disk drives, hybrid drives, tape drives, etc.) for storing large quantities of data. These multi-device storage systems are generally arranged in an array of drives interconnected by a common communication fabric and, in many cases, controlled by a storage controller, redundant array of independent disks (RAID) controller, or general controller, for coordinating storage and system activities across the array of drives. The data stored in the array may be stored according to a defined RAID level, a combination of RAID schemas, or other configurations for providing desired data redundancy, performance, and capacity utilization. In general, these data storage configurations may involve some combination of redundant copies (mirroring), data striping, and/or parity (calculation and storage), and may incorporate other data management, error correction, and data recovery processes, sometimes specific to the type of disk drives being used (e.g., solid-state drives versus hard disk drives).
Some multi-device storage systems employ storage devices capable of communicating with one another over the interconnecting fabric and/or network fabric. In some cases, these storage devices may be capable of peer-to-peer communication without the involvement of a storage control plane, such as a storage controller or host controller, as an intermediary. These peer storage devices may be capable of exchanging messages and/or transferring host data across the interconnecting fabric independent of the storage control plane. Reducing communication, data transfer, processing, and/or data management at the storage control plane may reduce bottlenecks and improve scalability as the number and capacity of storage devices increases.
Storage devices, particularly storage devices using flash memory for durable storage in transactional applications, are susceptible to data corruption over time. For example, data bits in flash memory may be corrupted by read, program, and erase sequences where memory cells in physical proximity to those that are used may be unintentionally stressed to a point where stored charge levels change enough to induce bit errors. Even with respect to read operations, read stress may influence both read and adjacent memory cells, particularly for high-volume repeated reads. While the error correction codes (ECC) used to encode the host data may enable recovery of some bit errors, corruption may exceed the capacity of ECC over time.
To combat the long-term effects of data corruption, some storage devices implement a data scrub process whereby data units are read and rewritten to enable ECC to correct accumulated errors and/or identify data units that have exceeded the capability of ECC to recover. These data scrubs may be based on a periodic schedule, read/write and/or endurance thresholds, and/or events, such as read or write errors. In some storage architectures, scheduling and management of data scrubs is managed at the storage control plane and individual storage devices respond to data management commands to initiate targeted data scrubs. Management of data scrubs at the storage control plane may create processing and scheduling bottlenecks, underutilize available compute resources at the storage devices, and reduce scalability of storage arrays.
Therefore, there still exists a need for storage architectures that enable peer-to-peer communication for data scrub offloading from the storage control plane.