Many entities today store large amounts of data in various forms, including backup data. While protecting data is a wise approach, storage and cost are not limitless. Large amounts of data can constrain systems and detrimentally impact performance. The sheer volume of data makes it difficult to maintain system speeds associated with less data.
To solve some of these problems, some computing systems, including backup systems, deduplicate the data. While this can conserve or reduce the amount of storage required to store the data, it also introduces complexities related to the deduplication process. In order to deduplicate the data, it is necessary to identify the duplicate data. This can require a significant amount of storage, processing and overhead. Further, it is necessary to store information that will allow the system to identify data that is a duplicate of existing data.
Typically, the information needed to performed deduplication is stored in fast memory such as RAM or flash memory. However, this memory is often smaller and more expensive than conventional disk storage. As a result, the entire database or index used to perform deduplication cannot be stored in the fast memory. Alternatively, storing the entire index in fast memory can prevent the fast memory from being used for other purposes. Either way, performance is affected. When less than the entire index is stored in fast memory, additional problems arise. One problem is that in order to determine whether certain data is a duplicate, it becomes necessary to access the index stored in slower memory. This may impact the performance of the deduplication process when a disk access is required. Systems and methods are needed to perform and improve the deduplication process.