A disk in a computer is the main device to store data for operation of applications. No matter what type it is, may be a HDD, a SSD or even a magnetic tape, after been used for a long time, the disk will fail to work. If data backup or archive is not properly carried out before the time of failure, data in the disk will be lost. It may cause disaster as there might be important data to the user as well as the operating system and configuration data of the computer system in the disk. Usually, the disk shows some signs prior to its malfunction. For example, stored data disappears or program fails too often. A user may easily notice the signs and takes action to replace the disk and save the data therein. It is feasible because the computer may have only a few disks and the user can keep observing the disk via the performance of the computer every day.
For the architecture a cloud-based service system runs over, it encounters the same challenge of disks. However, a more complex situation is that the architecture usually includes a huge number of disks for data storage and access. Due to different data stored, one disk may be accessed much often than others. Frequent use is a factor to shorten the lifetime of disks. However, it is very hard to observe physical performance of each disk regularly. Executing data backup often and replacing failed disk should not be the economic way to settle the problem for the administrator of the cloud-based service system. Therefore, some techniques have been disclosed to monitor disks in clusters and predict the lifetime of the disks. For example, a US patent application No. US2016232450 provides a storage device lifetime monitoring system for monitoring lifetimes of storage devices and a storage device lifetime monitoring method thereof. The method has steps to collect operation activity information corresponding to the storage devices, store multiple training data having the operation activity information and corresponding operation lifetime values, construct a storage device lifetime predicting model according to the operation activity information and the corresponding operation lifetime values of the training data, input the operation activity information of the storage devices into the storage device lifetime predicting model to generate a predicted lifetime value corresponding to each of the storage devices, and re-construct the storage device lifetime predicting model according to operation activity information and predicted lifetime value of each storage device. Thus, the lifetime of storage devices can be accurately predicted.
Said patent application uses data (operation activity information) from logs, e.g. system log, application log, or database log, for training and predicting lifetime. Although the data in the logs may not tell the exact condition of a disk, still some hint of the healthy status of the disk can be obtained from the records as there is relation between the abnormal records and the real lifetime of corresponding disk. It is able to effectively using historical data in prediction. If the method can precisely find out the lifetime for all disks as the logs revealed, real lifetime for a specific model should be within a certain range, for example, from 4,000 to 5,000 hours of use, based on the same manufacturing processes and quality requirements. However, in fact, some of the disks of the same model work for a short time, some work for a very long time and a majority of disks can have lifetime in the predicted range. Even similar operation activity information are available for two disks, they might have lifetime far from each other. It means some key factors are missing for analysis.
For the similar logs from two disks with different lifetimes, if we look at the I/O patterns of some performance data, e.g. IOPS (Input/Output Per Second), latency, and throughput, or relative information, e.g. CPU (Central Processing Unit) load or memory usage of the host, it can be found that the two disks run differently and this difference could be the factor causing different lifetime. For example, two disks have similar access and failed records in a year. One had been accessed intensively in three months while the other had been averagely accessed. Therefore, a method for providing more precise life expectancy of disks in a cloud-based service system, furthermore, extending life expectancy of the disks by analyzing the I/O patterns is desired.