1. Technical Field
This application relates to recovering in deduplication systems.
2. Description of Related Art
Computer systems may include different resources used by one or more host processors. Resources and host processors in a computer system may be interconnected by one or more communication connections. These resources may include, for example, data storage devices such as those included in the data storage systems manufactured by EMC Corporation. These data storage systems may be coupled to one or more servers or host processors and provide storage services to each host processor. Multiple data storage systems from one or more different vendors may be connected and may provide common data storage for one or more host processors in a computer system.
A host processor may perform a variety of data processing tasks and operations using the data storage system. For example, a host processor may perform basic system I/O operations in connection with data requests, such as data read and write operations.
Host processor systems may store and retrieve data using a storage device containing a plurality of host interface units, disk drives, and disk interface units. The host systems access the storage device through a plurality of channels provided therewith. Host systems provide data and access control information through the channels to the storage device and the storage device provides data to the host systems also through the channels. The host systems do not address the disk drives of the storage device directly, but rather, access what appears to the host systems as a plurality of logical disk units. The logical disk units may or may not correspond to the actual disk drives. Allowing multiple host systems to access the single storage device unit allows the host systems to share data in the device. In order to facilitate sharing of the data on the device, additional software on the data storage systems may also be used.
In data storage systems where high-availability is a necessity, system administrators are constantly faced with the challenges of preserving data integrity and ensuring availability of critical system components. One critical system component in any computer processing system is its file system. File systems include software programs and data structures that define the use of underlying data storage devices. File systems are responsible for organizing disk storage into files and directories and keeping track of which part of disk storage belong to which file and which are not being used.
The accuracy and consistency of a file system is necessary to relate applications and data used by those applications. However, there may exist the potential for data corruption in any computer system and therefore measures are taken to periodically ensure that the file system is consistent and accurate. In a data storage system, hundreds of files may be created, modified, and deleted on a regular basis. Each time a file is modified, the data storage system performs a series of file system updates. These updates, when written to a disk storage reliably, yield a consistent file system. However, a file system can develop inconsistencies in several ways. Problems may result from an unclean shutdown, if a system is shut down improperly, or when a mounted file system is taken offline improperly. Inconsistencies can also result from defective hardware or hardware failures. Additionally, inconsistencies can also result from software errors or user errors.
Additionally, the need for high performance, high capacity information technology systems is driven by several factors. In many industries, critical information technology applications require outstanding levels of service. At the same time, the world is experiencing an information explosion as more and more users demand timely access to a huge and steadily growing mass of data including high quality multimedia content. The users also demand that information technology solutions protect data and perform under harsh conditions with minimal data loss and minimum data unavailability. Computing systems of all types are not only accommodating more data but are also becoming more and more interconnected, raising the amounts of data exchanged at a geometric rate.
To address this demand, modern data storage systems (“storage systems”) are put to a variety of commercial uses. For example, they are coupled with host systems to store data for purposes of product development, and large storage systems are used by financial institutions to store critical data in large databases. For many uses to which such storage systems are put, it is highly important that they be highly reliable and highly efficient so that critical data is not lost or unavailable.
A file system checking (FSCK) utility provides a mechanism to help detect and fix inconsistencies in a file system. The FSCK utility verifies the integrity of the file system and optionally repairs the file system. In general, the primary function of the FSCK utility is to help maintain the integrity of the file system. The FSCK utility verifies the metadata of a file system, recovers inconsistent metadata to a consistent state and thus restores the integrity of the file system. To verify the metadata of a file system, the FSCK utility traverses the metadata of the file system and gathers information, such as status and bitmaps for the traversed metadata. The FSCK utility stores the gathered information in a memory of the data storage system. The FSCK utility then validates the correctness of the metadata using the information stored in the memory.
File systems typically include metadata describing attributes of a file system and data from a user of the file system. A file system contains a range of file system blocks that store metadata and data. A user of a filesystem access the filesystem using a logical address (a relative offset in a file) and the file system converts the logical address to a physical address of a disk storage that stores the file system. Further, a user of a data storage system creates one or more files in a file system. Every file includes an index node (also referred to simply as “inode”) that contains the metadata (such as permissions, ownerships, timestamps) about that file. The contents of a file are stored in a collection of data blocks. An inode of a file defines an address map that converts a logical address of the file to a physical address of the file. Further, in order to create the address map, the inode includes direct data block pointers and indirect block pointers. A data block pointer points to a data block of a file system that contains user data. An indirect block pointer points to an indirect block that contains an array of block pointers (to either other indirect blocks or to data blocks). There may be as many as five levels of indirect blocks arranged in an hierarchy depending upon the size of a file where each level of indirect blocks includes pointers to indirect blocks at the next lower level. An indirect block at the lowest level of the hierarchy is known as a leaf indirect block.
A file may be replicated by using a snapshot copy facility that creates one or more replicas (also referred to as “snapshot copies”) of the file. A replica of a file is a point-in-time copy of the file. Further, each replica of a file is represented by a version file that includes an inheritance mechanism enabling metadata (e.g., indirect blocks) and data (e.g., direct data blocks) of the file to be shared across one or more versions of the file. Snapshot copies are in widespread use for on-line data backup. If a file becomes corrupted, the file is restored with its most recent snapshot copy that has not been corrupted.
A file system based snapshot copy facility is described in Bixby et al. U.S. Patent Application Publication 2005/0065986 published Mar. 24, 2005, incorporated herein by reference. When a snapshot copy of a file is initially created, it includes only a copy of the file. Therefore the snapshot copy initially shares all of the data blocks as well as any indirect blocks of the file. When the file is modified, new blocks are allocated and linked to the file to save the new data, and the original data blocks are retained and linked to the inode of the snapshot copy. The result is that disk space is saved by only saving the difference between two consecutive versions of the file.
Deduplication is a space-saving technology intended to eliminate redundant (duplicate) data (such as, files) on a data storage system. By saving only one instance of a file, disk space can be significantly reduced. For example, if a file of size 10 megabytes (MB) is stored in ten folders of each employee in an organization that has ten employees. Thus, 100 megabytes (MB) of the disk space is consumed to maintain the same file of size 10 megabytes (MB). Deduplication ensures that only one complete copy is saved to a disk. Subsequent copies of the file are only saved as references that point to the saved copy, such that end-users still see their own files in their respective folders. Similarly, a storage system may retain 200 e-mails, each with an attachment of size 1 megabyte (MB). With deduplication, the disk space needed to store each attachment of size 1 megabyte (MB) is reduced to just 1 megabyte (MB) from 200 megabyte (MB) because deduplication only stores one copy of the attachment.
Data deduplication can operate at a file or a block level. File deduplication eliminates duplicate files (as in the example above), but block deduplication processes blocks within a file and saves unique copy of each block. For example, if only a few bytes of a document or presentation or a file are changed, only the changed blocks are saved. The changes made to few bytes of the document or the presentation or the file do not constitute an entirely new file.
The sharing of file system data blocks conserves data storage for storing files in a data storage system. The snapshot copy facility is a space saving technology that enables sharing of file system data blocks among versions of a file. On the other hand, a deduplication facility enables the sharing of file system data blocks within a file, among versions of a file, between versions of a file and unrelated files, and among unrelated files. Therefore, the deduplication facility eliminates from the data storage system any file system data blocks containing duplicative data content.
While deduplication systems have helped make data management much easier, they also come with a number of challenges, especially when recovering data. It may be difficult or impossible for the FSCK utility to recover a deduplicated file system that may further be replicated by the snapshot facility.