Technical Field
Aspects of the present disclosure relate to data storage, and more particularly, to apparatus and methods for storing and deduplicating large amounts data.
Discussion
Systems tasked with storing and accessing large amounts of data face significant challenges due to the volume of information these systems must process. These “big data” challenges include providing adequate processing performance (e.g., speed and throughput) while judiciously consuming data storage space. Data storage and protection systems constitute one class of systems that face these challenges. For example, it is preferable that data storage and protection systems maintain high enough data throughput to ingest and store terabytes, pedabytes, or even more data within restricted time windows while consuming as little data storage space as possible.
Data protection and deduplication technologies have conventionally been grouped into well-defined categories. For example, deduplication methodologies are typically categorized as either inline or post-process. Inline methodologies attempt to identify duplicate data as a backup data stream is introduced to a system and before that data is written to backend data storage. Post-process methodologies allow data to be stored and subsequently identify duplicate data within the previously stored data. Similar groupings exist with respect to data protection platform architectures, which are conventionally categorized as either single node or multi-node.
Historically, there has also been a coupling between respective data protection platform architectures and corresponding deduplication methodologies. More specifically single node architectures have employed inline hash-based algorithms of one form or another, whereas multi-node architectures have relied on byte differential deduplication. There are advantages and disadvantages to each of these designs as well as technical reasons behind the canonical pairing of deduplication methodology with data protection architecture.