Data has a lifecycle. As data progresses through its lifecycle, it experiences varying levels of activity. When data is created, it is typically used heavily. As it ages, it is typically used less frequently. In recognition of this, developers have worked to create systems and methods for ensuring that heavily used data are readily accessible, while less frequently used data are stored in a more remote location. Correlations between location of data storage and frequency of access to data are necessary because storage has an inherent cost. Generally speaking, the faster the storage medium is at providing access to data, the costlier the storage medium.
Two concepts relevant to the tradeoffs between data usage and storage costs are relevant for purposes of the embodiments disclosed herein. First, data storage systems involving tiered storage have emerged. These systems include multiple tiers of non-volatile storage with each tier providing a different quality of service. For example, a system may include a first tier (Tier 1) for SSDs (solid state drives) or cache, a second tier (Tier 2) for SAS (Serial-Attached SCSI) drives, and a third tier (Tier 3) for SATA (Serial Advanced Technology Attachment) drives, for example. In alternate arrangements of tiered storage, a cloud-based storage system could be implemented as Tier 2 or Tier 3 storage. As advances in data storage mediums and speeds are recognized over time, the types of storage used and the tiers in which they are used may vary.
Tiered data systems manage placement of data on the different storage tiers to make the best use of disk drive speed and capacity. For example, frequently accessed data may be placed on Tier 1 storage. Less frequently accessed data may be placed on Tier 2 storage. And seldom accessed data may be placed on Tier 3 storage.
The second concept of importance is the notion of categorizing data so that it can be stored on the most appropriate tier. Temperature variance has been used as a framework for distinguishing between data that is frequently used, i.e., “hot” as compared to less frequently used data, or “cold” data.
A significant challenge in the categorization of data within tiered data storage systems is the effect time has on data categorization. Typically, data are hot for a limited amount of time. In addition, determining the data temperature also consumes computing resources requiring prudence in judging how frequently to assess the temperature of the vast amounts of data that can be stored in a database. Furthermore, moving data among the tiers also consumes substantial computing resources, which again necessitates tradeoffs in terms of overall resource allocation.
Some data storage systems perform automatic storage tiering. These systems monitor the activity of storage elements and move data between storage tiers to best utilize available resources and promote efficiency. For example, a Tier 2 data set may be moved to Tier 1 if the automated system determines the data have become hotter. Similarly, data may be demoted from Tier 1 to Tier 2 if the system determines the data have become colder.
Automated storage tiering algorithms are typically run on a central processing unit, which is itself part of the data storage system. The system resources required to compute data temperatures for purposes of assessing whether data should be reallocated to a different tier are significant, especially for large enterprise databases. In addition, once the system determines which data should be moved from one tier to another, executing the various read/write/copy functions necessary to move the data from one tier to another is additionally resource intensive. Further compounding the data movement issue is the fact that the temperature of the data is in constant flux.
In order to address these, among other issues, typical automated storage tiering systems perform pre-scheduled reviews of data use statistics. Within a given scheduled review window, the system's CPU is tasked with evaluating data temperature, identifying candidate data segments for promotion/demotion within the tiered storage, and moving the identified data segments to a new tier. The window within which these tasks are performed is finite. If the system is unable to complete all of the tasks, some of the data will not be moved; and the system will begin again with evaluating data temperature in the next scheduled review period. For each cycle where data are unable to be relocated, database performance could degrade and storage level objectives may be missed.
Review cycle times are typically governed by the service level agreement. By way of example, and without limitation, a review cycle may be one a day, every hour, or as often as every ten minutes. Once a review cycle has been completed, the system is tasked with relocating data that has been flagged for tier relocation. If all of the data that has been flagged for relocation is not relocated within the allocated timeframe, the analysis must start afresh because data temperature is in constant flux. In other words, historic read/write/pre-fetch statistics are not reusable from one review cycle to the next.
Users can specify criteria to be used when making determinations for data promotion/demotion. These criteria are typically part of the service level agreement for the data storage system. In addition, users can alter the timeframe for, and the period within which, promotion/demotion analytics are gathered and executed. Even with this flexibility, however, there are still a number of inefficiencies, for example, and without limitation, the use of CPU processing power to perform backend functions reduces the amount of CPU power for client-facing operations. Second, if during a given promotion/demotion evaluation cycle, there is insufficient time to perform the recommended promotions/demotions, database performance continues to degrade by virtue of improper tier locations for data. Third, if data relocation is not completed within a given cycle, some of the CPU power devoted to calculating relocation candidates would be wasted because that task must be performed again in the next review cycle. There is thus a need for a backend data promotion/demotion engine to address these and other shortcomings in the art.