A deduplicating storage file system (“storage appliance”) can efficiently store a large amount of data by removing duplicate data. For example, the first time that a backup is run for a particular data set, the data is stored with little or no de-deduplication. The second, and subsequent times, that a backup is run for a same or similar data set, there is a high degree of redundancy between the first and subsequent backups. Removing this redundancy can save storage space. A de-duplication process includes computing a hash, a digest, or “fingerprint” of a chunk of data to be stored. This fingerprint can be checked against an index of previously stored fingerprints to determine whether a chunk to be stored is a duplicate of a previously stored chunk. Looking up a fingerprint of a chunk in an index requires a read I/O (input/output) to disk to determine whether the chunk exists in storage already. If so, a reference to the previously stored chunk is written to disk, rather than the chunk of data. A read I/O to determine whether a fingerprint is in a fingerprint index on disk is computationally expensive. A read I/O could bring many fingerprints from disk into cache, to reduce disk read I/O's during deduplication. However, during certain times, such as when data is being backed up for the first time, virtually none of the fingerprints will be stored in cache, thus fingerprint lookups in cache provide no benefit and waste memory and processor time. In addition, as the data is backed up additional times, the data will be spread out on disk, and fingerprints may not be in cache at all.
Current methods of reading and writing in a deduplication storage system are inefficient and do not take into account temporal and locality properties of the data stored in a deduplication storage system.