In general, the amount of data generated from various industries has been increasing rapidly leading to the need for intelligent, proactive and accurate data management. For example, in the science and engineering disciplines, the volume of generated data has been increasing at a massive rate. Petabytes scale data centers, for example, are generally used for storing and managing data from various scientific and engineering simulations. In accordance with the technical progression in the scientific and engineering fields, the speed at which data is generated and the number of data types thereof to be managed have been increasing. Accordingly, data storage and management is integral to the operation of such data centers. However, significant planning and estimation to predict the storage requirements of such data centers is required. Further, as more and more data is generated over time, such data centers end up with islands of storage with different technologies and different vendors. This causes the administrative costs to maintain such data centers to exponentially increase over time and these administrative costs can be a significant contribution to the total cost of ownership (TCO) in the case of Petabyte scale data centers. Hence, there is a need for an easy-to-use and flexible data management technology to help manage the massive amounts of data which require storage and to reduce the TCO. Further, there is a need to provide a completely automated and intelligent data management technology.
As one particular example of data management, at data centers which store genomes and genetic information as data, with the advent of next-generation genetic sequencers, the data generated per sequencer has exponentially increased and in excess of 25 TB of data may be generated on a daily basis. In addition, the cost of sequencing has drastically reduced, in turn, leading to greater and greater data generation per data center as more and more genetic sequencers are brought into operation. A primary goal of genome applications is to analyze the massive amounts of data generated by the sequencers, generate analysis results which are used for downstream analysis to study the significance of genomics and other life sciences data. Whereas, in the case of engineering applications, downstream analysis typically includes building simulation models from upstream analysis results. A key challenge in genome data centers is to manage the processed data while also managing the large amounts of new data from the sequencers. In view of the foregoing problem, there is a need for pro-active data management technology which can proactively predict the usage of data and migrate lower priority, processed data to cheaper storage while keeping primary storage capacity available for newly generated sequencer data. As another example, oil and gas exploration similarly involve applications which generate large amounts of data from seismic studies which require data management and can be subjected to unpredictable work loads. In the case of oil and gas data, volumes of up to 50 TB may generated on a weekly basis.
Several data management technologies and solutions have been proposed in the prior art to reduce administrative costs and help manage the massive amounts of data which require storage. In a heat map-based approach, cold data pages are migrated to cheaper storage tiers and hot data pages are migrated to a high performance primary storage tier. The number of read/write operations per second is used as a reference to classify data pages as hot or cold. The migration is made transparent to applications by using a page mapping table to map logical pages to physical pages where data currently resides. The decision to migrate or not is further dependent on a threshold set for the primary storage. Cold data is not migrated, for example, to a secondary storage tier if there is enough capacity left in primary storage tier. If there is a surge of new data which needs to be stored on the primary storage tier, providing such storage could become a bottleneck. Thus, a heat map-based approach does not provide proactive data management. Further, a problem exists in the heat map-based approach where no data management occurs until additional primary storage capacity is needed which can delay access speeds due to the additional processing load caused by migration. In addition, difficulties are present in managing the impact that new applications and updates to existing applications will cause when serving the existing data from storage.
In another approach, an attempt to provide proactive data management using pre-defined performance and availability requirements is made based on temporal characteristics for different data types. However, such a solution requires that the requirements for each data type be manually predefined. As such, the foregoing management solution fails to provide fully automated data management. In addition, the use of temporal characteristics can result in the erroneous data management as the application types which access data is not considered.
Further, an approach to data management where data usage behavior is learned and a knowledge base is created as a reference to manage other data with similar characteristics has been provided. In this approach, every data object is assigned a management class using assignment logic. Assignment logic uses predefined rules and logic to search the knowledge database to find similar data object. This similarity search uses static attributes like data object type, node where it was created and the size of the data object. If a match is found, the matched data object's usage history like creation time, last used, when compressed, when downloaded, which application created it and so on. This usage history is used to assign a management class for the new data object. If a match is not found in the database, this data object is added into the database and its usage is tracked for future management class assignment. This approach uses temporal analysis to learn data management needs and apply them to similar data object types. However, under varying workloads, it cannot accurately determine when to apply data management processing. For instance, if a data object A of type X was processed by an application and later compressed at a specific date, another data object B of type X will also be expected to be compressed after the same time interval due to the temporal nature of the analysis. However, if the data object B is processed under a different system load, the compression may happen sooner or later than estimated by this approach. Hence, temporal analysis is not accurate to determine when a data management action has to be taken. Thus, the temporal analysis approach lacks accuracy.