Maintenance and support for systems such as data storage systems (e.g., storage array system) often requires human observation of the state of system resources such as central processing unit (CPU) usage, memory foot print, network traffic, system temperature, solid-state disk (SSD) wear, hard disk drive (HDD) wear, and other system components and conditions. Resolution of anomalous conditions requires human intervention, and this intervention effort can range from fairly simple steps to very involved and complicated processes.
Even with the processes that involve only simple steps, simple mistakes in carrying out the processes can lead to expensive downtime for the system and, in the worst cases, can lead to customer data loss. This intervention effort starts with awareness that there is an anomalous condition with the storage array that adversely affects its ability to accomplish its primary functions. The current state of the storage array's ability to accomplish its primary functions is referred to as its “system health.” Existing techniques for monitoring system health, particularly in the case of storage array systems, pose many challenges.
For example, in a storage array system with storage devices such as SSDs and HDDS, maintenance typically requires human intervention in the form of observing disk end-of-life (EOL) estimation changes. As is known, SSDs and HDDs have finite operational life spans due to wear which needs to be estimated and monitored. The human intervention requires a consistent monitoring of disk statistics and usage measurements. This task is challenging, especially in storage arrays with large numbers of SSDs and HDDs.