Although tape has long been the dominant storage medium for data protection in large-scale data storage systems, such systems are being increasingly supplanted by disk-based deduplication systems. Deduplication can deliver an order of magnitude greater data reduction than traditional compression over time, which would imply that a deduplication system needs fewer disks and that the configured costs of a disk storage system are comparable to tape automation. However, it has been observed that deduplication systems use a lot more disks than expected due to their disk-intensive nature. The conventional way to increase system performance is to use more disks and/or to use faster, more expensive disks. Unfortunately, using this approach in a deduplication array can quickly make it more expensive than a tape library. It could also mean a waste of capacity since each disk comes with a lot of space, and adding disks for better I/O performance require paying more for unnecessary capacity.
To solve this problem, a Stream-Informed Segment Layout (SISL) architecture has been developed within certain deduplication storage systems such as the Data Domain Operating System (DDOS) by EMC Corporation. SISL optimizes deduplication throughput scalability and minimizes disk footprints by minimizing disk accesses. This allows the system throughput to be CPU-centric and speed increases can be realized as CPU performances increase. Deduplication involves breaking an incoming data stream into segments in a repeatable way and computing a unique fingerprint for the segment. This fingerprint is then compared to all others in the system to determine whether it is unique or redundant so that only unique data is stored to disk. To clients, the system appears to store the data in the usual way, but internally it does not use disk space to store the same segment repeatedly. Instead, it creates additional references to the previously stored unique segment. In a scalable deduplication system, fingerprints need to be indexed in an on-disk structure. To achieve speed, the system needs to seek to disk to determine whether a fingerprint is unique or redundant. SISL includes a series of techniques that are performed inline in RAM prior to storage to disk for quickly filtering new unique segments and redundant duplicate segments. In SISL, new data segments for a backup stream are stored together in units called localities that, along with their fingerprints and other metadata, are packed into a container and appended to the log of containers. The fingerprints for the segments in the localities are kept together in a metadata section of the container, along with other file system structural elements. This keeps fingerprints and data that were written together close together (maintains locality) on disk for efficient access during writes when looking for duplicates and during reads when reconstructing the deduplicated stream.
The number of write streams supported in a deduplication backup system is mainly enforced on the basis of memory resources available on the server. To increase the number of streams, the amount of resources needs to be increased or they need to be utilized more efficiently. In present systems, each data stream takes one container to preserve locality of the data. Current systems typically use NVRAM (non-volatile RAM) into which data streams are written before they are committed to disk. This provides faster response time for client operations because it allows data to be written asynchronously before a disk write process, which takes longer. However, NVRAM is limited in that it can only support a certain number of streams, that is, the NVRAM configuration (available size) limits the number of concurrent streams for user data and metadata for the file. Thus, the capacity of NVRAM represents a bottleneck in the system with regard to how many streams can be concurrently processed in the backup system. Optimizing the use of NVRAM requires minimizing the number of containers. For example, in present systems, each data stream and associated metadata stream uses one container each. Reducing the number of containers required would increase the efficiency of NVRAM usage.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions. EMC, Data Domain, Data Domain Restorer, and Data Domain Boost are trademarks of EMC Corporation.