Various types of storage servers are used in modern computing systems. One type of storage server is a file server. A file server is a storage server which operates on behalf of one or more clients to store and manage shared files in a set of mass storage devices, such as magnetic or optical storage based disks. The mass storage devices are typically organized as one or more groups of Redundant Array of Independent (or Inexpensive) Disks (RAID). One configuration in which file servers can be used is a network attached storage (NAS) configuration. In a NAS configuration, a file server can be implemented in the form of an appliance, called a filer, that attaches to a network, such as a local area network (LAN) or a corporate intranet. An example of such an appliance is any of the NetApp Filer products made by Network Appliance, Inc. in Sunnyvale, Calif.
A file server can be used for a variety of purposes, such as to backup critical data. One particular type of data backup technique is known as “mirroring”, which involves backing up data stored at a primary site by storing an exact duplicate (an image) of the data at a remote secondary site. The goal of mirroring is that if data is ever lost at the primary site, it can be recovered from the mirror copy at the secondary site. In a simple mirroring configuration, a source file server located at a primary storage site may be coupled locally to a first set of mass storage devices (e.g., disks), to a set of clients through a local area network (LAN), and to a destination file server located at a remote storage site through a wide area network (WAN) or metropolitan area network (MAN). The destination storage server located at the remote site is coupled locally to a second set of mass storage devices (e.g., disks) at the secondary site.
In operation, the source file server receives and services various read and write requests from its clients. Write requests are generally buffered for some period of time that depends on the available system resources, network bandwidth and desired system performance. From time to time, during an event called a “consistency point”, the source file server stores new or modified data in its local mass storage devices based on the buffered write requests. Also, from time to time, new or modified data is sent from the source file server to the destination file server, so that the data stored at the secondary site can be updated to mirror the data at the primary site (i.e., to be a consistent image of the data at the primary site).
In a data storage system such as a file server, it is desirable to reduce the cost of storing data. One way of achieving this is to reduce the amount of data that needs to be stored, such as by using compression. In the known prior art, certain data backup systems have used explicit compression and decompression techniques at the application level (i.e., in the client) to accomplish this. However, that approach requires special software to be built into the client applications. Other backup based systems such as tape drives and disk controllers have used built-in hardware compression to achieve similar goals, but not at the file system level. To incorporate a hardware based disk controller would require another layer of software to maintain a separate disk block mapping and is therefore undesirable for many purposes. The failure of such a card or software would render the data inaccessible and would provide a potential failure point.
File system based compression avoids this kind of failure point. At least one known file system based approach attempts to find duplicate blocks of data by utilizing a unique cryptographic hash signature of the data. Such approaches tend to offer good compression ratios in the presence of a large number of duplicated files (e.g., multiple independent versions of the same or nearly the same file) but have experienced severe performance problems to date.
Another problem with file system based approaches has been that compressing or decompressing extremely large data sets, such as databases, tends to require extremely large amounts of processing resources, such as CPU time and memory, especially in the presence of random input/output (I/O) workloads. For example, to decompress only the last 4 kbytes of data in a 100 Gbyte database would require the reading and processing of the full 100 Gbytes of data if the whole file was compressed at once.