Many information technology (“IT”) operations and activities can be scheduled to run one or more times within some periodic cycle (daily, weekly, monthly, quarterly, etc.). One such application can be data backup. Data backups can be essential to preserving and recovery of data in the event of data loss, for example. To avoid interfering with daily user activities, data backups can be performed during periods of low application server utilization, typically, on weeknights and on weekends. The backup job workload can be the same or different depending on how much data needs to be protected and when. In some applications, backup jobs can be scheduled and/or configured using a commercial backup application, an operating system shell scripting, and/or in any other manner.
Backup application employ a plurality of techniques to manage data designated for backup. One such technique includes deduplication. Deduplication can be used to eliminate redundancy in the execution of periodically executed backup tasks. In some cases, deduplication can reduce data storage capacity consumption as well as an inter-site network bandwidth. It can do so by identifying and eliminating similar and/or identical sequences of bytes in a data stream. Deduplication can also include computation of cryptographic and/or simple hashes and/or checksums, as well as one or more forms of data compression (e.g., file compression, rich media data compression, delta compression, etc.).
In one form of data deduplication for backup data, the deduplicated data can be stored as a collection of thousands to millions (or any other number) of version clusters. Each version cluster can represent a related collection of data zones that have been determined to be similar to each other in their size and content. Deduplication can be achieved by maintaining an anchor from which all other similar zones can be delta compressed against.
During delta compression process, most new zones that are created by partitioning an incoming backup data stream can be matched with an anchor of an existing version cluster. Delta compression can be performed against these matched pairs. The new zone from the incoming backup stream can be resident in a main memory, however, the anchor that it has been matched with can be fetched from a secondary storage (e.g., a magnetic disk drive, RAID-configured group of disk drives, etc.). As each new incoming zone is processed, fetching of the existing anchor from the magnetic disk can involve a relatively slow electro-mechanical operations, which can include a random-distance head seek operation and/or a rotational delay before data can be transmitted from the disk(s) to memory for delta compression.
By way of a background, a data storage device used for storing and retrieving information can include rapidly rotating disks that can be coated with magnetic material. The data can be read in a random-access manner, where individual blocks of data can be stored and/or retrieved in any order. To achieve that the data storage device can include one or more rigid rapidly rotating disks with magnetic heads arranged on a moving actuator arm to read and write data to the surfaces. The above electro-mechanical operations can be incurred by the data storage devices before each write and/or read operation takes place. The random-distance head seek operation can involve moving the heads radially from the edge to the center of the disk and back again to land on the requested cylinder and/or track, where the data to be read and/or written is stored. This can be measured in milliseconds (“ms”). Further, the rotational delay latency relates to the time it takes for a sector within a track to be available under the head. The rotational delay can be determined by the rotational speed of the drive, which is typically in the range of 5000 to 15000 revolutions per minute. At 7200 revolutions-per-minute (“RPM”), the maximum rotational latency can be less than 8.33 ms with the average latency at half that value. Thus, a combination of the above latencies can be in the tens of milliseconds prior to data transfer. When accessing data sequentially along a track, once the first sector is accessed, the latencies can be near zero.
However, conventional backup systems, such as NAND-flash solid state disk (“SSD”), are not capable of identifying subsets of the “most popular” existing anchors (the anchors that can have more dependent data versions (or delta-compressed versions) attached to them than other anchors) that can be joined with new zones of data in a backup stream for the purposes of delta-compression. Thus, the conventional systems suffer from increased data seek and recovery times as well as rotational delay time. In addition, the data transfer rate associated with conventional solid-state disk drive systems for moving data into memory is substantially reduced (e.g., by a factor of five).
Further, since a deduplication system can contain millions of version clusters, each with a single anchor from which new delta compressed versions can be created, conventional solid-state disk drives can be more costly than magnetic disk drives, and can thus, be cost effective if deployed as a small (e.g., 5-10%) fraction of the overall disk storage capacity of a system. In that regard, the conventional systems cannot perform an adaptive scheduled periodic caching of data that can determine which anchors are the most popular so that they can be encached in a solid-state disk drive to maximize capabilities of the cache in a deduplication session.