The capacity of flash-based solid-state drive (SSD) arrays is growing day by day, with hundreds of terabytes. All these arrays are built with multiple flash nodes. Each flash node may in turn contain multiple flash memory units that are supplied by multiple different vendors. On occasion, a flash memory unit may be detected as unresponsive by a flash translation layer (FTL) at runtime, which may necessitate replacement of the unresponsive flash memory unit or a flash node containing the unresponsive flash memory unit.
The replacement of the unresponsive flash memory unit might not be necessary, as the unresponsiveness of a flash memory unit at runtime may be due to various reasons, such as a failure of a flash memory unit itself, environmental conditions (e.g. temperature, power supply, etc.), or other software and hardware bugs. Aiming to limit operational cost, negative publicity, revenue loss, and to improve manufacturing quality in the future, a vendor and/or manufacture of the flash memory unit may want to diagnose a cause of the unresponsiveness of the flash memory unit to avoid the replacement of unresponsive flash memory unit.
Current state-of-art systems provide very limited information for diagnosing causes of an unresponsive flash memory unit at runtime. Except the flash memory unit itself, important information affecting normal operation of the flash memory unit is generally not available, especially information about the run-time environment at the moment the unresponsiveness is detected. This may greatly affect accurate diagnosis of the detected unresponsiveness for the flash memory unit.