Data storage is a critical component for computing. In a computing device, there is a storage area in the system to store data for access by the operating system and applications. In a distributed environment, additional data storage may be a separate device that the computing device has access to for regular operations. In an enterprise environment, the stored data in the storage area of the computing device or additional data storage often access one or more offsite storage devices as a part of a global disaster recover (DR) strategy to protect the entire organization by having one or more copies of data at offsite locations.
The performance of a storage system may be periodically measured against certain benchmarks. In order to measure the performance of the storage system, there are cases where there is a need to populate a large amount of small files to mimic a customer scenario with a specific locality of a particular process or operation of the storage system, such as garbage collection, enumeration operations. A conventional method is to utilize existing storage file system using a corresponding file system protocol. Such an approach has to utilize the entire file system stack, which takes a long period of time in order to create a large number of small files with a specific locality.
In a deduplicated file system, such as Data Domain™ file system from EMC® Corporation, there are two components responsible to manage the files in the system. The first one is directory manager (DM), which is a hierarchical mapping from the path to the inode representing a file. The second one is content store (CS), which manages the content of the file. Each file has a content handle (CH) that is stored in the inode that is created by CS every time the file content changes. Each CH represents a file that is abstracted as a Merkle tree of segments. A file tree can have up to multiple levels, such as 7 levels: L0, . . . , L6. The L0 segments represent user data and are the leaves of the tree. The L6 is the root of the segment tree. Segments from L1 to L6 are referred to as metadata segments or Lp segments. They represent the metadata of a file associated with a file tree. An L1 segment is an array of L0 references. Similarly an L2 is an array of L1 references and so on.
A segment is considered live if it can be referenced by any live content in the file system. The file system packs the segments into containers which are written to disk in a log-structured manner. Each container is structured into sections. The first section is the metadata section and the following sections are referred to as compression regions (CRs). A CR is a set of compressed segments. In the metadata section there are all the references or fingerprints that identify the segments in the container.
A garbage collection is a form of automatic memory management. The garbage collector, or just collector, attempts to reclaim garbage, or resources (e.g., storage resources) occupied by files that are no longer in use by the file system. A garbage collection process of the file system is responsible for enumerating all live segments in the live content handles of the file system. A physical garbage collector traverses segments of all the files simultaneously using a breadth-first approach. A logical garbage collector traverses in a file-by-file basis through the file system in a depth-first approach.
A deduplicating system is optimized for ingesting data at a very high throughput to enable small backup windows. A typical backup system from Data Domain® can ingest data at a throughput of higher than 2 GB/sec, when the data is written through a few big files. In contrast when lots of small files are written the throughout drops to less than 10 MB/sec. This is due to the inherent limitations of the protocol stack and inability of the duplication software storage stack to deal with small files efficiently.