De-duplication may be referred to as “dedupe”. A dedupe data set may include an index, a repository of sub-blocks, and re-creation information. The index may be configured to facilitate locating a stored sub-block. The re-creation information may be configured to facilitate assembling related sub-blocks into larger items (e.g., files). The repository of sub-blocks may be configured to ease accessing stored sub-blocks. A dedupe system creates and/or manages the data set to facilitate determining whether a sub-block under consideration is a duplicate sub-block or a unique sub-block and to facilitate reducing either the amount of duplicate data stored or the amount of duplicate data transmitted. The dedupe data set is stored on physical media on physical devices.
One physical medium and device on which a dedupe data set may be stored is random access memory (RAM). RAM provides relatively fast random access as compared to other random access devices (e.g., disk). RAM is generally readily accessible to a processor executing dedupe processes. RAM is also relatively fast compared to other media. When sub-blocks are stored in RAM, the sub-blocks can be acquired using random accesses that may involve a bus access but no external input/output (i/o). Similarly, when information for re-creating a larger item (e.g., file) is stored in RAM, the information can be quickly accessed. Additionally, when the index is stored in RAM, index locations can be accessed using efficient random accesses.
Unfortunately, RAM is currently a finite resource and is also a relatively expensive storage medium as compared to other physical devices (e.g., disk, tape). Thus a device performing a de-duplication process likely has access to a finite amount of RAM. Since RAM is finite, neither all the sub-blocks for a dedupe data set, nor the re-creation information, nor the index can be stored completely in RAM. Therefore, at least some sub-blocks, re-creation information, and/or index portions are stored on some media and some device other than RAM. Conventionally, there may have been insufficient attention paid to how sub-blocks, re-creation information, and/or index portions should be arranged on these other storage media and devices. When attention was directed at how dedupe data, re-creation information, and/or index portions should be stored on other media and devices, the attention was typically applied at the generic, theoretical level, rather than at the actual observed data set level.
Disk is one additional storage medium and device used in dedupe. A disk generally includes a spinnable platter(s) with a movable read/write head(s). Disks, like RAM, are generally considered to be random access devices. While both RAM and disk provide random access, disk accesses generally take longer than RAM accesses because the platter(s) needs to be spun to a certain sector and the read/write head(s) need to be positioned over a certain track. Since the spinning and repositioning can be performed within a short enough period of time, the disk is considered to be random access. Thus, sub-blocks, re-creation information, and/or index portions may be available on disk through random accesses, although these random accesses are slower than random accesses to RAM. The disk may also be slower because the disk may not be as directly connected to a processor as RAM. For example, a disk may be connected through a disk controller that provides access to an external device.
Although random access is useful, sequential access, even to memory or a disk, may provide improved input and/or output in certain situations, particularly when large amounts of data are being read and/or written. Thus, schemes for improving disk access for dedupe systems have been attempted. These schemes typically involve finding ways to minimize the number of disk i/o operations because, although the disk access is random, it is still significantly slower than RAM access.
Tape is another storage medium used for storing sub-blocks, re-creation information, and/or an index. Unlike RAM or disk, which generally include both the storage medium and the device for accessing the storage medium, a tape may reside external to its access device (e.g., tape drive). Thus, while RAM is generally always in the same place in an apparatus (e.g., computer) and while a disk is generally always in the same place in a system (e.g., server), a tape may be moved from tape drive to tape drive. Also, while the same RAM and disk are generally always available in a system, different tapes may be available to a system. Thus, different considerations may exist for planning for tape usage in de-duplication systems.
FIG. 1 illustrates the logical components 100 and the physical components 110 described above. A dedupe data set may include an index 102, re-creation information 104, and a sub-block repository 106. These logical items may be arranged in a variety of data structures. For example, an index 102 may be arranged as a linear index, as a binary tree, as an n-ary tree, and in other ways. The data structures are stored on physical devices 110. The physical devices can include, but are not limited to, RAM 112, disk 114, and tape 116. In different embodiments, the data structures are stored on combinations of the physical devices 110.
A tape in a tape drive is conceptually equivalent to a disk with respect to reading and writing. Both require repositioning the media so that the read/write head(s) can access data stored at a certain location. Both have well-defined maximum times for positioning any location on the media for access by the read/write head. However, tapes and tape drives are generally not considered random access media and devices due to the time required to (re)position a tape for reading and/or writing. Tapes are more generally considered to be sequential access media. While tapes may have slower access times than disk for some operations, tapes may have vastly superior access times for other operations. For example, for large-scale sequential input/output, tapes may significantly outperform disk. Also, since tapes in an extensible tape library provide theoretically infinite storage, tapes are suitable for many de-duplication applications.
One operation performed by dedupe systems is finding sub-blocks in one data item (e.g., file) that are related to sub-blocks in another data item (e.g., file) so that the duplicate items can removed. Finding related (e.g., duplicate, similar) sub-blocks may involve accessing both an index and a repository. As described above, RAM, disk, and tape may have different strengths and weaknesses and may have different performance characteristics for different operations. Therefore, example methods and devices concern storing data on tape and retrieving data from tape in manners that increase efficiency for some de-duplication operations.
The foregoing statements are not intended to constitute an admission that any patent, publication or other information referred to herein is prior art with respect to this disclosure. Rather, these statements serve to present a general discussion of technology and associated issues in the technology.