1. Field of the Invention
The invention relates to the field of data storage. More specifically, the invention relates to storing composite data streams.
2. Background of the Invention
The amount of data to be stored continues to grow. In particular, the size of the applications and the data generated there from is increasing. Moreover, systems/users are backing up multiple copies of a given set of data to maintain multiple versions. For example, snapshots of a given database stored in a server are copied and stored over time, thereby allowing a given version/snapshot of a set of data to be restored.
There are existing backup systems that use what are called composite data streams. FIG. 1 is a diagram of composite data streams generated for storage as a backup according to the prior art. In FIG. 1, at a first time a constituent user data stream 103 is being backed up. The contents of the constituent user data stream is conceptually illustrated as a series of letters “APKLZATUALMNOAKAPLY . . . ” These letters may represent a variety of different levels of granularity of data and/or boundaries, including fixed sized chunks regardless of file boundaries, different files, fixed sized chunks within file boundaries etc. The constituent user data stream 103 is combined (e.g., multiplexed) with a constituent administrative data stream 104 to form a composite data stream 101 (e.g., a first snapshot) for backup storage. In other words, the constituent user data stream 103 is broken into data stream blocks that are interleaved with data stream blocks of the constituent administrative data stream 104 (e.g., tape markers, time stamps, hashes, error correction data, etc.).
A dashed line in the middle of FIG. 1 separates a second backup operation performed at a later time (a second time). In particular, at this later time the user data has been modified, and thus a constituent user data stream 105 is formed. The constituent user data stream 105 is conceptually illustrated as “APKLZAUALMNOAKAPLY . . . ” Thus, the difference between the constituent user data streams 103 and 105 is that the “T” has been removed from the constituent user data stream 105. The constituent user data stream 105 is combined with a constituent administrative data stream 106 to form a composite data stream 109 (e.g., a second snapshot) for backup storage. Since the constituent user data stream 105 is different from the constituent user data stream 103, the resulting composite data stream 101 is different from the composite data stream 109 (even if the constituent administrative data streams 104 and 106 are the same); In particular, at least certain of the data stream blocks of the constituent user data stream 103 in the composite data stream 101 contain different data than the data stream blocks of the constituent user data stream 105 in the composite data stream 109. Similarly, if the constituent administrative data streams 104 and 106 were different, the resulting composite data streams 101 and 109 would be different even if the user data (the constituent user data streams 103 and 105) had remained the same.
To provide an exemplary use of composite data streams, backup clients residing on different computers of a local area network may be provided and/or collect data to be backed up on their respective computers. This data to be backed up may or may not be in the form of a composite data stream as a result of the application(s) which created it. These backup clients may each transmit (e.g., over a network) data streams (e.g., constituent user data streams, which themselves may be composite data streams) to a backup server that forms composite data streams (e.g., by combining a constituent user data stream with one or more other constituent user data streams and/or an administrative data stream). It should be thus understood that there may be multiple layers of composite data streams. The backup server periodically transmits (e.g., directly or over a network) these composite data streams to a storage server (e.g., a network file server, a tape library emulator server, etc.) for storage, as well as maintains a catalog of the backups it is managing and what it has stored therein. Although forming composite data streams is common, different backup systems structure composite data streams differently (e.g., certain backup systems use fixed length blocks of user data separated by administrative data blocks; other backup systems punctuate variable length user files with administrative data; etc.).
Typically, much of the data across different snapshots remains the same (e.g., there is little difference between the constituent user data streams 103 and 105). For example, if the data is backed up for a given user on a daily basis and such user is updating only one of the number of files on a given day, the data in this file is the only data that has been modified. As a result, storage servers that store entire composite data streams are relatively inefficient in that they store large amounts of redundant data.
There are some backup systems that allow for the sharing of data across a number of different snapshots/versions to reduce the amount of data being stored. Such backup systems are referred to as segment reuse backup systems. Segment reuse backup systems typically operate by breaking up the data for each snapshot into segments. The segments of a current snapshot are compared to the segments of a previous snapshot to determine if there are matching segments. For any segments that match, only a pointer to the segment of the previous snapshot need to be stored to backup that segment from the current snapshot. In this manner, the efficiency of the backup system is improved by reducing the storage of redundant data.