The present invention relates generally to data storage systems, and systems and methods to improve storage efficiency, compactness, performance, reliability, and compatibility. In computing, a file system specifies an arrangement for storing, retrieving, and organizing data files or other types of data on data storage devices, such as hard disk devices. A file system may include functionality for maintaining the physical location or address of data on a data storage device and for providing access to data files from local or remote users or applications. A file system may include functionality for organizing data files, such as a directories, folders, or other container structures for files. Additionally, a file system may maintain file metadata describing attributes of data files, such as the length of the data contained in a file; the time that the file was created, last modified, and/or last accessed; and security features, such as group or owner identification and access permission settings (e.g., whether the file is read-only, executable, etc.).
Many file systems are tasked with handling enormous amounts of data. Additionally, file systems often provide data access to large numbers of simultaneous users and software applications. Users and software applications may access the file system via local communications connections, such as a high-speed data bus within a single computer; local area network connections, such as an Ethernet networking or storage area network (SAN) connection; and wide area network connections, such as the Internet, cellular data networks, and other low-bandwidth, high-latency data communications networks.
The term “data deduplication” refers to some process of eliminating redundant data for the purposes of storage or communication. Data deduplicating storage typically compares incoming data with the data already stored, and only stores the portions of the incoming data that do not match data already stored in the data storage system. Data deduplicating storage maintains metadata to determine when portions of data are no longer in use by any files or other data entities.
The CPU and I/O requirements for supporting an extremely large data deduplicating storage are significant, and are difficult to satisfy through vertical scaling of a single machine. Additionally, data deduplicating storage must be both reliable, robust, and easily expandable to increase data traffic and storage capacity. However, prior data deduplicating storage systems have been difficult to scale, unreliable, and vulnerable to data loss in the event of system crashes and restarts.