Organizations are facing new challenges in meeting long-term data retention requirements and IT professionals have responsibility for maintaining compliance with a myriad of new state and federal regulations and guidelines. These regulations exist because organizations, in the past, have struggled with keeping necessary information available in a useable fashion. Compounding this problem is the continued explosive growth in digital information. Documents are richer in content, and often reference related works, resulting in a tremendous amount of information to manage.
In order to better understand underlying access patterns, it's helpful to first briefly describe the classification of digital information. The collection of all digital information can be generally classified as either structured or unstructured. Structured information refers to data kept within a relational database. Unstructured information is everything else: documents, images, movies, etc. Both structured and unstructured data can be actively referenced by users or applications or kept unmodified for future reference or compliance. Of the structured and unstructured information, active information is routinely referenced or modified, whereas inactive information is only occasionally referenced or may only have the potential of being referenced at some point in the future. The specific timeframe between when information is active or inactive is purely subjective.
A sub-classification of digital information describes the mutability of the data as either dynamic or fixed. Dynamic content changes often or continuously, such as the records within a transactional database. Fixed content is static read-only information; created and never changed, such as scanned check images or e-mail messages. With regard to long-term archiving inactive information, either structured or unstructured, is always considered to have fixed-content and does not change.
Over time, information tends to be less frequently accessed and access patterns tend to become more read-only. Fixed-content read-only information is relatively straightforward to manage from an archiving perspective. Of course, even at the sub-file level dynamic information, either structured or unstructured, may contain large segments of content which are static. Examples of this type of information include database files where content is being added, and documents which are edited.
Irrespective of the type of digital information, fixed or dynamic, many organizations back up their digital data on a fixed basis. For instance, many organizations perform a weekly backup where all digital data is duplicated. In addition, many of these organizations perform a daily incremental backup such that changes to the digital data from day-to-day may be stored. However, traditional backup systems have several drawbacks and inefficiencies. For instance, during weekly backups, where all digital data is duplicated, fixed files, which have not been altered, are duplicated. As may be appreciated, this results in an unnecessary redundancy of digital information as well as increased processing and/or bandwidth requirements. Another problem, for both weekly as well as incremental backups is that minor changes to dynamic files may result in inefficient duplication of digital data. For instance, a one-character edit of a 10 MB file requires that the entire contents of the file to be backed up and cataloged. The situation is far worse for larger files such as Outlook Personal Folders (.pst files), whereby the very act of opening these files causes them to be modified which then requires another backup.
The typical result of these drawbacks and inefficiencies is the generation of large amounts of back up data and in the most common back-up systems, the generation of multiple data storage tapes. In this regard, the inefficient backups result in the generation of multiple backup tapes, which then have to be stored. Typically, such tapes are stored off-line. That is, the tapes may be stored where computerized access is not immediately available. Accordingly, to recover information from a backup tape may require contacting an archiving facility, identifying a tape and waiting for the facility to locate and load the tape.
As the price of disk storage has come down, there have been attempts to alleviate the issues of tape backups utilizing disk backups. However, these disk backups still require large amounts storage to account for the inefficient duplication of data. Accordingly, there have been attempts to identify the dynamic changes that have occurred between a previous backup of digital data and current set of digital data. In this regard, the goal is to only create a backup of data that has changed (i.e, dynamic data) in relation to a previous set of digital data.
One attempt to identify dynamic changes between data backups and store only the dynamic changes is represented by Capacity Optimized Storage (COS). The goal of COS is to de-duplicate the redundancy between backup sets. That is, the goal of COS is to try to compare the current data set with a previously stored data set and only save the new data. Generally, COS processing divides an entire set of digital data (e.g., of a first backup copy) into data chunks (e.g., 256 kB) and applies a hashing algorithm to those data chunks. As will be appreciated by those skilled in the art, this results in a key address that represents the data according to the hash code/algorithm. When a new data set (e.g., a second back up copy) is received for backup, the data set is again divided into data chunks and the hashing algorithm is applied. In theory, if corresponding data chunks between the first and second data sets are identical, it is assumed that there has been no change between backups. Accordingly, only those chunks which are different from the first backup set are saved, thereby reducing the storage requirements for subsequent back ups. The main drawback to COS is that to significantly reduce the redundancy between backup sets, it is desirable to utilize ever smaller data chunks. However, as the size of the data chunks is reduced, the number of key addresses increases. Accordingly, the storage and indexing of the increased number of key address works to eliminate the benefits of the reduced amount of duplicate data.
Use of COS processing allows for the creation of disk accessible data back up thereby allowing for more ready access to backed up data sets. In this regard, COS may be incorporated into a virtual tape library VTL such that it emulates a tape storage device. The system allows the user to send data to an off-site disk storage center for back up. However, this requires that an entire data set be the transmitted to the VTL, where the entire data set may be optimized (e.g., COS) for storage. Further, for each subsequent backup, the entire data set must again be transferred to the offsite storage center. As may be appreciated, for large organizations having large data sets requiring backup, such an off-site storage system that requires transmission of the entire data set may involve large bandwidth requirements to transfer the data the as well as high processing requirements to optimize and compare the data. Finally, organizations utilizing off-site VTL's are 100% reliant on the backup application for restoration of their data again leaving the user potentially exposed to the unavailability of information in the case of accidental deletion or disk corruption.