Vast amounts of active and archived corporate electronic information exists on backup tape media. Retrieving and collating this data is becoming increasingly important as the information is not only utilized for knowledge management, but may also be the subject of discovery requests by attorneys engaged in litigation involving the corporation. Conventional methods of producing data from large quantities of backup tapes are difficult to implement, cost prohibitive, or both. Managing data from backup media is particularly problematic in the case where companies have many different tape backup systems using different backup environments.
A previous attempt to solve the problem of retrieving information from backup tapes involves restoring the tapes using a “Native Environment” (NE) approach. The NE approach recreates the original backup environment from which the tape was generated so that data from the tapes can be restored and moves the restored data from the replicated environment to a target storage system for further analysis.
Replicating the NE in order to restore backup tapes requires that all server names, configurations, software versions, user names, and passwords are consistent with the environment as it stood at the time of the backup. Replicating all of this information becomes quite challenging as systems age, names of systems change, passwords change, software versions change, and administrators change. Furthermore, backup software is typically designed to restore data for the purposes of disaster recovery (an all or nothing proposition) and not to intelligently process large amounts of data from large numbers of media to obtain only relevant information.
Even if the backup environment can be recreated, however, all the records may need to be examined. Those records may contain information regarding thousands of employees in the case of a large company. Managing all this data is a nightmare, even if the environment can be recreated. For many companies, the amount of information can exceed a terabyte. Storing over a terabyte of information takes a great deal of memory space and consumes valuable computer resources during the storing operation.
Beyond trying to manage the sheer volume of data, other problems exist. Most backup systems retrieve data for backup on a regular schedule, however, this means that with every successive backup much of the data saved is a duplicate of data saved during the previous backup. This is especially problematic as data sensitivity increases, as backup frequency usually increases commensurately with data sensitivity. Additionally, though the data itself may be duplicative, the location where this duplicative data is found may be different, and the location where the data resides may be of importance as well. As, for example, when this data must be restored to its original locale.
Thus there is a need for systems and methods to store and collate data from disparate locales which retains the location of various pieces of data without duplicating identical data.
Conventional de-duplication systems generally aim to take a duplicated set of data and remove duplicate entries. This is generally performed by having a single central system traverse all of the data in the duplicated data set. While traversing the duplicated data set, the system through some means, generally a hashing means, attempts to characterize the data in some unique manner (e.g. hash value). As is well known, a hash function is any well-defined procedure or function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. The system then generally creates a unique list of data by maintaining a list of unique hash values. When the system encounters a data item whose hash value already exists in the system, the system deletes the data item corresponding to that hash value as it is considered to be duplicate data. This type of system is not entirely advantageous as such systems only utilize a single node to traverse all of the data, which can be taxing and slow.