A storage system typically comprises one or more storage devices into which information may be entered, and from which information may be obtained, as desired. The is storage system includes a storage operating system that functionally organizes the system by, inter alia, invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage environment, a storage area network and a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a disk array, wherein the term “disk” commonly describes a self-contained rotating magnetic medium storage device. The term disk in this context is synonymous with hard disk drive (HDD) or direct access storage device (DASD).
Archival data storage is a central part of many industries that operate in compliance and/or regulated environments, e.g., banks, government facilities/contractors, securities brokerages, etc. In many of these environments, it is necessary to store selected data, e.g., electronic-mail messages, financial documents and/or transaction records, in an immutable manner, possibly for long periods of time. Typically, data backup operations are performed to ensure the protection and restoration of such data in the event of a failure. However, backup operations often result in the duplication of data on backup storage resources, such as disks and/or tape, causing inefficient consumption of the storage space on the resources.
One form of long term archival storage is the storage of data on magnetic tape medium. A noted disadvantage of physical tape medium is the slow data access rate and the added requirements for managing a large number of physical tapes. In response to these noted disadvantages, several storage system vendors provide storage systems configured as virtual tape library (VTL) systems that emulate tape storage devices using a is plurality of disk drives. In typical VTL environments, the storage system serving as the primary storage performs a complete backup operation of the storage system's file system (or other data store) to the VTL system. Multiple complete backups may occur over time thereby resulting in an inefficient consumption of storage space on the VTL system. This may occur due to, e.g., identical data appearing in a plurality of backups, thereby resulting in the same data being stored in a plurality of locations (e.g., data blocks) on the VTL system. It is thus desirable to eliminate duplicate data (de-duplication) on the storage resources, such as disks associated with a VTL system, and ensure the storage of only single instance of data to thereby achieve storage compression.
Conventional de-duplication systems typically require that the data is first written to disk and then, at a later point in time, the data is de-duplicated. This typically arises because the de-duplication system cannot process the data at sufficient speeds to enable real-time de-duplication of a new incoming data set, e.g., a tape backup stream. A noted disadvantage of such conventional de-duplication systems is that the VTL system requires sufficient space to write the entire data set prior to de-duplication, thereby eliminating any space savings from de-duplication.