Data deduplication is an efficient data storage method used to eliminate or reduce redundant data, such that one unique instance of the data is retained on storage media instead of multiple instances of the same. Typically, the multiple instances are replaced with a pointer to a single instance. For example, a typical email system might contain 100 instances of the same file attachment. If the email platform is backed up or archived without deduplication, all 100 instances are saved. When data deduplication is utilized, only one instance of the attachment is actually stored. Each subsequent instance is referenced back to the one saved copy.
Thus, data deduplication promotes more efficient use of disk space and as a result reduces storage costs and also allows for longer disk retention periods. This approach also provides for better recovery time objectives and reduces the need for tape backups. Data deduplication also reduces the data that must be sent across a wide area network (WAN) for remote backups, replication, and disaster recovery.
In a network backup environment, a client system may backup data in a remote storage device over a network and coordinate the backup with a storage management server. For instance, the International Business Machines (IBM®) Tivoli® Storage Manager product provides software for a client and server systems to backup client data (IBM and Tivoli are registered trademarks of IBM). The client transfers files from its file system to the storage management server. The storage management server maintains a backup database having information on files sent to the storage management server.
When a file (i.e., data stream) is sent from the client to the server, there are file attributes (e.g., file size, file modification time, etc.) and ancillary data streams associated with the file (e.g., access control lists, extended attribute streams, generic alternate data streams, etc.) that are sent to the storage management server. The ancillary data streams associated with a file are usually unbounded in size and therefore cannot be stored as attributes in a database. Instead, the ancillary data streams are typically stored in the disk/tape storage. Therefore, these data streams are transmitted within the file's data stream.
The placement of the ancillary data streams in the file is arbitrary. That is, the ancillary data streams may be positioned in front of the file data or after the file data during data transmission. Additionally, these data streams may be different from machine to machine even if the data file is the same. For example, two users that have the exact same data file (e.g., a text file with the same content) may have different metadata (e.g., ownership, permission, creation date, etc.) associated with the respective file. Thus, the two data files will have similar data sections (i.e., including the same text content) and different metadata sections (i.e., including different information for ownership, permission creation date, etc.).
In the current systems, the storage management server lacks knowledge of what portion of the data file includes file data and what portion includes metadata or data related to the ancillary data streams. The current technique for deduplication involves comparing data chunks using brute force or a hashing algorithm, which is inefficient. Methods and systems are needed that can overcome the aforementioned shortcomings.