Contemporary businesses accumulate tremendous amounts (e.g., petabytes) of data in databases that are stored on all kinds of media such as tapes, hard disk drives, solid state drives (SSDs), etc. Legal requirements, government rules and regulations, and business rules and best practices require that the databases are archived and backed up frequently. Consequently, thousands of petabytes (PBs) of data are already being stored, and the amount of stored data continues to skyrocket.
Data deduplication methods and systems are used to reduce the amount of data to increase efficiency and reduce costs. In general, a deduplication system finds identical parts in different data files and stores those identical parts only once. The deduplication system also maintains metadata so that the data files can be organized and rebuilt at a later time when they are accessed. However, the tremendous amounts of data being stored test the limits of existing deduplication methods and systems. Current deduplication methods and systems work well for several petabytes of data but are not designed for amounts of data on the scale of thousands of petabytes.
The use of SSDs (instead of other storage media such as random access memory (RAM)) to store large amounts of data presents some challenges. SSDs have longer read and write latencies relative to, for example, double data rate type three synchronous dynamic RAM (DDR3 DRAM). Also, SSDs are erased before being written to and can only be erased a limited number of times before wearing out.
On the other hand, SSDs have a number of advantages that make them a good choice for storing large amounts of data. For deduplication, files are split into blocks or fragments commonly referred to as “chunks” (e.g., four kilobyte (KB), 16 KB, or 256 KB chunks) with associated metadata. Each unique chunk is stored with its metadata. The metadata may be, for example, 16 bytes (B), 32 B, 128 B, or 256 B in size. For 512 PB of data, assuming each chunk is 16 KB in size and also assuming 32 B of metadata per chunk, the storage space for just the metadata is one PB. Storing this amount of data is not practical using RAM, but is practical using SSDs.
Also, to satisfy the aforementioned regulations and requirements, the metadata needs to be hard-written into storage. When power to RAM is lost or interrupted, the data held by the RAM is lost. SSDs use NAND-based flash memory, for example, which retains data without power.
Thus, the advantages of SSDs include their capacity and non-volatility. To mitigate their longer access time (read and write latencies), data is written in parallel. The basic unit of each SSD read/write operation is referred to as a page. For a page size of 16 KB, assuming 128 B of metadata per chunk, the metadata for 128 chunks can be read or written in parallel within a page.
The metadata for each chunk includes a hash value, or signature, that uniquely identifies the chunk. Hence, to determine whether it is necessary to store a new chunk (to determine whether an identical chunk has been previously stored), the signature for the new chunk can be compared to signatures for chunks that have already been stored. If the signature for the new chunk matches an existing signature, then the new chunk does not need to be stored.
As noted above, the basic unit of an SSD read/write operation is a page. To get the signature of a chunk for comparison to other signatures, an entire page (e.g., 16 KB) is read and transferred from the SSD to the central processing unit (CPU). This transfer can consume a significant amount of resources on the CPU as well as memory bandwidth and bus bandwidth.
More specifically, a client with data to be stored on a storage server will split the data into chunks and calculate a signature for each chunk. In an implementation, the client sends each signature to a signature server that holds a library of signatures for chunks already stored on the storage server. The signature server's role is to determine whether the signatures from the client match any of the signatures in the signature library. To accomplish this, an entire page (e.g., 16 KB) is transferred to memory for each signature, and the CPU will locate and extract the signature within the page and compare the extracted signature to the signatures from the client. However, a signature may be only 32 B in size. Thus, to get a signature for comparison to other signatures, up to 500 times more data than is needed is read and transferred (e.g., 16 KB of data is read to get a 32 B signature).
Furthermore, based on the number of clients that are requesting signature comparisons and the number of signature servers, the number of comparisons per signature server can be estimated. Each comparison requires at least two input/output (I/O) accesses, so the number of I/O operations per second (IOPS) per signature server can also be estimated. Considering CPU and SSD capabilities, the IOPS requirements turn out to be so large that a large number of signature servers are needed, and it is also necessary to use more expensive, higher bandwidth Peripheral Component Interconnect Express (PCIe) SSDs to provide the necessary capacity.
In summary, conventional deduplication methods are inefficient, expensive, and occupy significant amounts of CPU, memory, and bus resources.