Enterprises as well as individuals are becoming increasingly dependent on computers. As more and more data are generated, the need for efficient and reliable data backup storage systems is increasing. There are a variety of systems in existence today, utilizing both local and network storage for backup.
FIG. 1 is a block diagram illustrating a typical network backup system. Data are generated from a variety of sources, for instance data sources 100, 102 and 104. During the backup operation, the data sources stream their data contents to backup server 106. The backup server receives the data streams, optionally processes the data streams, and sends the data to backup devices such as tape 108 and data organizer 110. Data organizer 110 processes the data received and writes the data to a storage device 112, which can be a single disk or a disk array. The data organizer can be a device separate from the backup server or a part of the backup server.
During a backup operation, the data from the data sources are copied to the backup devices. Commonly, there is a substantial amount of data from each of the data sources that remains the same between two consecutive backups, and sometimes there are several copies of the same data. Thus, the system would be more efficient if unchanged data are not replicated.
There have been attempts to prevent redundant copying of data that stay the same between backups. One approach is to divide the data streams from the data sources into segments and store the segments in a hash table on disk. During subsequent backup operations, the data streams are again segmented and the segments are looked up in the hash table to determine whether a data segment was already stored previously. If an identical segment is found, the data segment is not stored again; otherwise, the new data segment is stored. Other alternative approaches including storing the segments in a binary tree and determining whether an incoming segment should be stored by searching in the binary tree.
While such an approach achieves some efficiency gains by not copying the same data twice, it incurs significant disk input/output (I/O) overhead as a result of constantly accessing the disk to search for the data segments. Also, the searching techniques employed in the existing systems often involve searching for the ID in a database, which becomes less efficient as the size of the database grows. It would be desirable to have a backup system that would reduce the disk I/O overhead and increase search efficiency, while eliminating the unnecessary data replication.