In a computer system, files are used for many purposes, such as to organize information, to store data, or to contain applications or a list of commands. The term “file” as used herein refers broadly to any logical entity that can be accessed, used or manipulated as a container by entities such as system users, applications, and other resources. While a file can be associated with several properties, including but not limited to, a filename, a file descriptor, and a set of blocks that contain the contents or data of the file, it should be noted that these are just properties of the file and not the file itself. Put another way, the properties are just manifestations of the file, while the file itself is the logical entity that is being manipulated.
When a file is copied on a computer system, a duplicate of the file is created. The duplicate typically has a different file name, but initially it will have the same contents as the original. The contents of the duplicate file are stored on previously unused space in the computer system. For example, if a file on a computer hard drive with a size of 1 megabyte is copied to a new file, the latter will occupy an additional 1 megabyte of storage space on the hard drive.
Replicating large files can result in an inefficient use of system resources. For example, when a copy of a file is later modified, only a small portion of the contents of the copy may differ from the original. However, because both the original and the copy occupy their own space on the system, much of the space occupied by the copy is needlessly duplicated.
For example, consider a large word processing file. The author of the document may want to save different versions as it is being written or edited, but most of the contents of the file may remain exactly the same. As new versions are created and modified, only the data blocks for each version that are associated with the modified content will be changed, leaving unmodified the remainder of the data blocks for the file. As a result, most of the data storage blocks associated with the different versions of the file are exactly the same, yet for each separate version of the file, a separate copy of each of those unchanged data blocks will exist. As the size of the file increases and/or the number of copies increases, the number of duplicated data blocks increases, resulting in an inefficient use of the system's storage capacity.
Note that it is important to distinguish copying a file from another form of file manipulation called linking. A link can be created between two file names such that both names refer to the same file. For example, in the Unix operating system, the link command can be used to associate a new file name with an existing file name and the contents of that existing file. The result is that there is still only one set of data blocks (or content), but now the file can be referred to by both the original and new file name. If the content of the file is changed, then that change is reflected in the file regardless of which linked file name is used to refer to the file. Thus, linking is different from copying in that copying creates multiple, independent files, whereas with linking there is only one file that has multiple names instead of two distinct files.
One approach for creating copies of data without duplicating the information that remains the same between the original data and a copy of that data is the “copy-on-write” (C-O-W) technique. The basic idea of copy-on-write is that an original and a copy share the portions of the data that remain the same between the original and the copy. As data is changed in either the original or the copy, new data portions are created to reflect the changes, and such data portions are now specific to the original or the copy. However, data portions that remain the same between the original and the copy continue to be shared.
For example, some versions of the Unix operating system, such as Solaris by Sun Microsystems and Mach by Carnegie Mellon University, utilize copy-on-write memory. With this approach, two processes can share memory blocks in the computer system's memory until one process writes to a particular memory block. At that point, the process that writes to the particular memory block gets its own private copy of that memory block, and the original memory block is no longer shared between the two processes.
FIGS. 1A, 1B, and 1C provide a simple illustration of the sharing of memory blocks between two processes. The system illustrated in FIGS. 1A and 1B has a memory 100 that is comprised of a plurality of memory blocks that store data or information. For purposes of explanation, only memory blocks 110, 120, 130, 140, 150, and 160 are shown. In FIG. 1A, memory blocks 110, 120, and 130 are associated with a process 102. Also in FIG. 1A, memory blocks 110, 120, and 130 are associated with a process 104, which initially is using the same information as process 102.
If process 104 then makes a change to some of the information that is stored in memory block 130, the information in memory block 130 is copied to an unused memory block, such as memory block 140. Then memory block 140 is modified to reflect the change in the information.
FIG. 1B shows the result of this change. Process 104 is now associated with memory blocks 110, 120, and 140, but process 104 is no longer associated with memory block 130. Meanwhile, process 102 remains associated with memory blocks 110, 120, and 130. Thus, in FIG. 1B, memory blocks 110 and 120 are shared by processes 102 and 104, since both those processes are using the same information stored in those memory blocks. However, because the information in memory block 130 that was originally shared by processes 102 and 104 is now different for the two processes, process 102 remains associated with memory block 130 while process 104 is now associated with memory block 140.
FIG. 1C shows what would happen if no sharing of the memory blocks by the processes were allowed. In this case, process 102 is associated with memory blocks 110, 120, and 130 while process 104 is associated with memory blocks 140, 150, and 160. After the change in the information in memory blocks 130 and 160 between the two processes, the contents of memory block 130 and memory block 160 will be different. The contents of memory blocks 110 and 140 remain the same, and similarly the contents of memory blocks 120 and 150 remain the same. Thus, if memory blocks are not shared, the system will be storing exact duplicates of the contents of memory blocks 110 and 120 in memory blocks 140 and 150, respectively, which is an inefficient use of the system's memory capacity.
Another implementation of copy-on-write can be found in some file systems that use “snapshots” to provide a backup feature to allow users to retrieve older versions of a file. For example, Network Appliance offers a file system called “write anywhere file layout” (WAFL), and the Veritas file system (VxFS) contains a similar feature. With this type of backup feature, a snapshot is taken of the entire file system at a given point in time, effectively freezing the state of the files at that moment. Later after the snapshot is taken, if any changes are made to the files on the file system, then new data blocks are created and modified to reflect the changes to the contents of each of the changed files. This means that as files are changed following the snapshot, new data blocks are used to reflect changes in the contents of the files, but unchanged data blocks continue to be shared between the snapshot and the current working versions of the files.
With this backup approach, any current versions (or working versions) of the files being used following the snapshot are just newer versions, not copies, of the original files that were frozen at the time of the snapshot. In other words, the current version is not separate from original frozen version. Instead, the current version reflects changes to the original version since the point at which it was frozen by taking the snapshot.
This backup feature allows the user to retrieve an earlier version of a file as that file existed at the time of the snapshot. For example, if a user deletes a file or if the user changes a file and later wants to return to an earlier version, the user can retrieve the version of that file at the time of the snapshot by accessing this backup feature of the file system. While this type of backup feature can be helpful in minimizing storage required for backups, it is limited in that it only applies to backups and it is only implemented for the entire file system.
Based on the foregoing, there exists a need for a mechanism for replicating an individual file or group of selected files on a computer system that minimizes the storage space required when there are portions of the original file and the copy that remain the same.