Current storage management systems employ a number of different methods to perform storage operations on electronic data. For example, data can be stored in primary storage as a primary copy that includes production data, or in secondary storage as various types of secondary copies including, as a backup copy, a snapshot copy, a hierarchical storage management copy (“HSM”), as an archive copy, and as other types of copies.
A primary copy of data is generally a production copy or other “live” version of the data that is used by a software application and is generally in the native format of that application. Primary copy data may be maintained in a local memory or other high-speed storage device that allows for relatively fast data access. Primary copy data is typically intended for short term retention (e.g., several hours or days) before some or all of the data is stored as one or more secondary copies, for example to prevent loss of data in the event a problem occurred with the data stored in primary storage.
Secondary copies include point-in-time data and are typically intended for long-term retention (e.g., weeks, months or years depending on retention criteria), before some or all of the data is moved to other storage or is discarded. Secondary copies may be indexed so users can later browse, search and restore the data. After primary copy data is backed up, a pointer or other location indicia such as a stub may be placed in the primary copy to indicate the current location of that data. Further details may be found in the assignee's U.S. Pat. No. 7,107,298, filed Sep. 30, 2002, entitled SYSTEM AND METHOD FOR ARCHIVING OBJECTS IN AN INFORMATION STORE.
One type of secondary copy is a backup copy. A backup copy is generally a point-in-time copy of the primary copy data stored in a backup format as opposed to in native application format. For example, a backup copy may be stored in a backup format that is optimized for compression and efficient long-term storage. Backup copies generally have relatively long retention periods and may be stored on media with slower retrieval times than other types of secondary copies and media (e.g., on magnetic tape), or be stored at on offsite location.
Another form of secondary copy is a snapshot copy. From an end-user viewpoint, a snapshot may be thought as a bitmap or instant image of the primary copy data at a given point in time. A snapshot may capture the directory structure of a primary copy volume at a particular moment in time, and may also preserve file attributes and contents. In some embodiments, a snapshot may exist as a virtual file system, parallel to the actual file system. Users may gain a read-only access to the record of files and directories of the snapshot. By electing to restore primary copy data from a snapshot taken at a given point in time (e.g., via a reversion process), users may also return the current file system to the prior state of the file system that existed when the snapshot was taken.
A snapshot may be created instantly, using a minimum of file space, but may still function as a conventional file system backup. A snapshot may not actually create another physical copy of all the data, but may simply create pointers that map files and directories to specific disk blocks and that indicate which blocks have changed. The snapshot may be a copy of a set of files and/or directories as they were at a particular point in the past. That is, the snapshot is an image, or representation, of a volume of data at a point in time. A snapshot may be as a secondary copy of a primary volume of data, such as data in a file system, an Exchange server, a SQL database, an Oracle database, and so on. The snapshot may be an image of files, folders, directories, and other data objects within a volume, or an image of the blocks of the volume.
Snapshots may be created using various techniques, such as copy-on-write, redirect-on-write, split mirror, copy-on-write with background copy, log structure file architecture techniques, continuous data protection techniques, and/or other techniques. Once a snapshot has been taken, subsequent changes to the file system typically do not overwrite the blocks in use at the time of a snapshot. Therefore, the initial snapshot may use only a small amount of disk space to record a mapping or other data structure representing or otherwise tracking the blocks that correspond to the current state of the file system. Additional disk space is usually only required when files and directories are actually modified later. Furthermore, when files are modified, typically only the pointers which map to blocks are copied when taking a new snapshot, not the blocks themselves. For example in the case of copy-on-write snapshots, when a block changes in primary storage, the block is copied to secondary storage before the block is overwritten in primary storage and the snapshot mapping of file system data is updated to reflect the changed block(s) at that particular point in time, e.g., the pointer in that snapshot now points to the old block now in secondary storage.
Data storage systems may utilize snapshots for a variety of reasons. One typical use of snapshots is to copy a volume of data without disabling access to the volume for a long period. After performing the snapshot, the data storage system can then copy the data set by leveraging the snapshot of the data set. As another example, a data storage system may use a snapshot and/or other point-in-time secondary copies (e.g., copies generated from a snapshot) to permit a user to revert data back to its state at a specific point in time during a reversion process.
An HSM copy is generally a copy of the primary copy data, but which typically includes only a subset of the primary copy data that meets a certain criteria and is usually stored in a format other than the native application format. For example, an HSM copy might include only that data from the primary copy that is larger than a given size threshold or older than a given age threshold and that is stored in a backup format. Often, HSM data is removed from the primary copy, and an address, pointer or stub is stored in the primary copy to indicate its new location. When a user requests access to the HSM data that has been removed or migrated, systems use the stub to locate the data and often make recovery of the data appear transparent even though the HSM data may be stored at a location different from the remaining primary copy data.
An archive copy is generally similar to an HSM copy, however the data satisfying criteria for removal from the primary copy is generally completely removed with no stub left in the primary copy to indicate the new location (i.e., where it has been moved to). Archive copies of data are generally stored in a backup format or other non-native application format. In addition, archive copies are generally retained for very long periods of time (e.g., years) and in some cases are never deleted. Such archive copies may be made and kept for extended periods in order to meet compliance regulations or for other permanent storage applications.
Application data over its lifetime typically moves from more expensive quick access storage to less expensive slower access storage. This process of moving data through these various tiers of storage is sometimes referred to as information lifecycle management (“ILM”). This is the process by which data is “aged” from more expensive forms of secondary storage with faster access/restore times down through less expensive secondary storage with slower access/restore times, for example, as the data becomes less important or mission critical.
In some embodiments, storage management systems may perform additional operations upon copies, including deduplication, content indexing, data classification, data mining or searching, electronic discovery (E-discovery) management, collaborative searching, encryption and compression.
One example of a system that performs storage operations on electronic data that produce such copies is the Simpana storage management system by CommVault Systems of Oceanport, N.J. The Simpana system leverages a modular storage management architecture that may include, among other things, storage manager components, client or data agent components, and media agent components as further described in U.S. Pat. No. 7,246,207, filed Apr. 5, 2004, entitled “SYSTEM AND METHOD FOR DYNAMICALLY PERFORMING STORAGE OPERATIONS IN A COMPUTER NETWORK.” The Simpana system also may be hierarchically configured into backup cells to store and retrieve backup copies of electronic data as further described in U.S. Pat. No. 7,395,282, filed Jul. 15, 1999, entitled “HIERARCHICAL BACKUP AND RETRIEVAL SYSTEM.”
The Simpana system and other storage systems may perform backup and Direct Access Recovery (“DAR”) storage operations under the Network Data Management Protocol (“NDMP”), an open standard protocol for backups of heterogeneous network-attached storage across an enterprise. Under the NDMP standard, during backup, an NDMP data server is responsible for creating backup data and sending it to an NDMP mover in a data stream format specified by the NDMP protocol. To the NDMP mover, the data stream may appear to be simply a raw stream of bytes or bits. The NDMP mover is then responsible for writing the data stream to backup or secondary storage media, such as tape. The NDMP mover may be on the same physical machine as the data server, or different machine. During a restore or recovery of a backed-up data object, the NDMP data server is responsible for requesting NDMP-formatted backup data from the mover and restoring the data object to a target location from that backup data, e.g., a target location in primary storage. To request a backup copy of a data object, the NDMP data server sends an offset and length that identify the location of the data object in the original NDMP data stream that was sent to the NDMP mover at backup. Using the offset and length information provided by the NDMP data server, the NDMP mover retrieves the desired data from the backup media and returns it to the NDMP data server in the form of an NDMP-formatted data stream.
Unfortunately, NDMP standards do not readily facilitate restore operations if the NDMP mover modified the NDMP data stream via encryption, compression, deduplication, etc., before writing the data to tape or other secondary storage media. These modification techniques may alter the data in an unpredictable way. For example, when an NDMP data stream is deduplicated and/or compressed, the total size of the modified data that must be stored is typically much smaller than the size of the original NDMP data stream. However, the modified data is not simply a linearly “scaled down” version of the original data stream. Instead, the original data stream is scaled down unevenly in a manner that depends on the contents of the original data stream and/or the types of modification techniques that are applied to the original data stream. Since these modification techniques alter the data in an unpredictable manner, at the time of restore, the NDMP mover can no longer use the offset and length provided by the NDMP data server to correctly retrieve and return requested data objects. For example, if a data object was originally represented in an original NDMP backup data stream at offset OF1 and length L1, the modified version of that object may instead be stored in modified form with an offset OF2 and length L2; furthermore, there may be no closed-form mathematical relationship to automatically derive OF2 and L2 from OF1 and L1. Thus, if the data mover receives a request from an NDMP data server to retrieve an object using offset and length values OF1 and L1, the data mover may be unable to fulfill the request.
The Simpana system and other storage systems may also permit users to perform a reversion operation in order to return client data to a previous state at a specified point in time by using a previously obtained point-in-time copy, such as a snapshot copy or other secondary copy. However, this reversion operation will effectively erase all changes to that data that were made after the specified point in time. Thus, such a reversion operation is irreversible, since a user cannot undo the reversion operation in order to return data to its state at the time the reversion operation was performed.
The need exists for systems and methods that overcome the above problems, as well as systems and methods that provide additional benefits. Overall, the examples herein of some prior or related systems and methods and their associated limitations are intended to be illustrative and not exclusive. Other limitations of existing or prior systems and methods will become apparent to those of skill in the art upon reading the following detailed description.
In the drawings, the same reference numbers and acronyms identify elements or acts with the same or similar functionality for ease of understanding and convenience.