1. Field of the Invention
The invention relates generally to computer software. More specifically, the present invention provides for methods and systems for electronic data archival, backup and recovery over a network.
2. Background Art
A great deal of information is stored electronically and must be backed up or archived in a systematic fashion to protect against the loss of critical data. The traditional process of data archival and/or backup involves writing a copy or mirror image of the data to a tape or other data storage device on a regularly scheduled basis, often nightly. Once the data is copied to the data storage device, the data storage device is physically moved off-site to another, secure location where it is stored. If the original data is lost or corrupted, a copy of the data is available and can be retrieved from the off-site location. A drawback to the traditional back up process is that it is cumbersome and time-consuming. In addition, as the amount of data being backed up increases, the number and cost of storage devices needed to keep copies of the data increases as well.
Another back up process that has grown increasingly popular in recent years involves the use of wide are networks to transmit back up data to a secure site. Data management companies, such as EVault, Inc. now provide backup and archival services to allow a company to transmit backup data to a data manager on a regularly scheduled basis. The backup data is usually encrypted to protect against the release of proprietary information and a third-party data storage manager handles the storage and recovery of the business data. By sending information to a third-party data storage manager, a company avoids the cost of paying for and maintaining its own data storage system and can take advantage of the economies of scale available to a data storage manager.
A drawback to the process of transmitting backup data to a data storage manager over networks like the internet is the length of transmission. In many cases, the amount of data to be backed up exceeds the capacity of the communication line between a company and its data storage manager. When large or numerous files are involved, the traditional data backup process of copying or mirroring an entire data system requires a great deal of time and/or a tremendous amount of bandwidth, neither of which is typically available. Furthermore, only a small percentage of the files typically need to be backed up per session. One of the ways used to address this problem involves the use of delta extraction algorithms. A delta extraction algorithm monitors the changes made to the data files of a company between backups and rather than transmitting the entire file, transmits only the changes to the file. This results in a much quicker backup process as files that are unchanged are not transmitted. Additionally, it reduces the amount of storage needed for a backup.
For example in some delta extraction processes, selected files are processed in a depth first order, ascending alphabetically. For example, a “C:” drive under Windows NT might be processed in the following order:
C:\autoexec.bat
C:\boot.ini
C:\Dir1\
C:\Dir1\Data1.dat
C:\Dir1\Data2.dat
C:\Dir1\Sub1\Image1.bmp
C:\Dir1\Sub1\Image2.bmp
C:\Dir1\Text1.dat
C:\Dir1\Text2.dat
C:\Dir2\Data1.dat
C:\Dir2\Data2.dat
C:\Dir2\Image1.bmp
C:\Dir2\Image2.bmp
C:\pagefile.sys.
These files are sequentially compared against a delta mapping file (DTA), using alphabetic comparisons to determine if a file is new or part of a prior backup. Next, like files are compared block by block to see if the file had been changed (“delta changes”) since the last backup. If a new file is discovered, it is copied in its entirety. Files that were deleted in the interim since the last backup are ignored and are no longer used as part of future backups. Once this sometimes lengthy process is completed, the data is transmitted from a client application to a server (file names and changed/new blocks of data).
At the server, files included in the backup are processed in the order in which they were backed up, depth first order, ascending alphabetically. These files are sequentially merged with the previous backup data, using alphabetic comparisons to determine if a file is new or part of a prior backup. Like files are merged or indexed block by block to update and verify the new delta block changes. New files, those that were not present in any of the prior backups, are merged in their entirety. Deleted files are ignored, that is, treated as no longer part of the backup data.
Not withstanding the use of a delta extraction algorithm, backup processes can take a long time when large amounts of data are involved and when a backup of a complex or extensive data system is involved. Some sub-directories may contain upwards of 10,000 separate files and take so long to sort that problems such as slowdowns, timeouts, or even system crashes frequently occur. Client-side applications known in the art that perform the delta extraction algorithms rely heavily upon sorting routines. These routines sort a log of file changes to allow the changes to be matched with the baseline files that are stored off-site by the data storage manager. The client-side process of extracting the changes and transmitting the log is often time-consuming.