Data deduplication may reduce the amount of storage space used in a storage system by detecting and preventing redundant copies of data from being stored to the storage system. For example, if multiple instances of a file exist in a deduplicated file system, a deduplicated data system may store a single instance of the file and link all instances of the file to the single stored instance. Data deduplication techniques may be useful in a variety of contexts, including archival storage.
Unfortunately, traditional data deduplication techniques may perform poorly with some data formats. For example, email messages may contain significant amounts of duplicate information (e.g., one email message may quote another email message in entirety), but traditional data deduplication techniques may fail to exploit the duplicate information in email messages. For example, a short reply quoting a long email message may add little new information yet still be distinct from the quoted email message and so may not be de-duplicable with the quoted email message. Alternatively, a traditional deduplication system may divide the short reply and the long quoted email into smaller chunks and/or blocks for individual deduplication—but the chunks of the long quoted email may be unlikely to line up with the quoted portion of the short reply, again defeating deduplication efforts. Accordingly, the instant disclosure identifies and addresses a need for additional systems and methods for archiving email messages.