The invention relates generally to computer systems and data storage, and more particularly to a more efficient way to store files of a file system.
The contents of a file of a file system may be identical to the contents stored in one or more other files. While some file duplication tends to occur on even an individual user""s personal computer, duplication is particularly prevalent on networks set up with a server that centrally stores the contents of multiple personal computers. For example, with a remote boot facility on a computer network, each user boots from that user""s private directory on a file server. Each private directory thus ordinarily includes a number of files that are identical to files on other users"" directories. As can be readily appreciated, storing the private directories on traditional file systems consumes a great deal of disk and server file buffer cache space.
Techniques that have been used to reduce the amount of used storage space include linked-file or shared memory techniques, essentially storing the data only once. However, when these techniques are used in a file system, the files are not treated as logically separate files by the linked-file or shared memory techniques. For example, if one user makes a change to a linked-file, or if the contents of the shared memory change, every other user linked to that file or using the shared memory sees the change. This is a significant drawback in a dynamic environment where files do change, even if not very frequently. For example, in many enterprises, different users need to maintain different versions of files at different times, including traditionally read-only files such as applications. As a result, linked-file or shared memory techniques would work well for files that are strictly read-only, but these techniques fail to provide the flexibility needed in a dynamic environment.
One general concept that has been employed in virtual memory and database messaging systems to reduce the amount of duplicated data stored is known as a xe2x80x9cCopy-on-Writexe2x80x9d technique. With the Copy-on-Write technique, at the time of a requested copy, a link between a source and destination is established, but the copy is not made. Instead, the actual copying of the data is postponed and takes place only if and when either the source or destination is modified. For example, in virtual memory Copy-On-Write, processes send messages to one another with copy semantics, using the virtual memory system to map the same memory into the address space of both processes. If a process subsequently writes into the memory, a protection fault occurs, and the system makes a copy of the page in question and maps the newly copied page into the address space of the faulting process. Copy-on-write is useful when modification is expected to be a relatively rare occurrence, wherein the extra cost associated with detecting and carrying out the delayed copy operation is outweighed by the savings achieved by not having to make a copy most of the time. For example, in the database Copy-On-Write, only one copy of a mail message is maintained for a message sent to multiple users. A copy is made only if one of the recipients modifies the original mail message, a relatively rare occurrence.
Unlike the linked-file or shared memory techniques, the Copy-On-Write concept thus preserves the logical separation of modified data. While this works well for virtual memory and database messaging systems, file systems have a number of complexities that are not addressed by these prior Copy-On-Write techniques. For example, unlike raw data, files are renamed, deleted, and may be opened and closed by multiple users at different times. Many files also have security issues that need to be addressed. Moreover, unlike mail messages, there are many different types of files in a file system, and not all files in the file system are good candidates for copy-on-write, such as frequently changed application data files. Detecting and carrying out delayed. copying for those types of files in a file system would be costly and wasteful. In short, there has heretofore not been any way in which to represent duplicate file system data as a single instance thereof, while maintaining a logical distinction between the user files corresponding to that single instance of data, so that the semantics of private files are preserved.
Briefly, the present invention provides a system and method for storing the data of files having duplicate data by maintaining a single instance of the data, and providing logically separate links to the single instance of the data representing each file in place of the file. The present invention thus manages a single copy of identical files, while maintaining the semantics of having separate normal files.
To accomplish the single instance store (SIS) of the present invention, a normal source file is converted to a link (file) to a common store file, such as by direct user request. A common store file is a file owned by SIS (rather than by a user) that is used to contain the data from the files represented by SIS links. Logically separate links to the same common store file may be created for files having duplicate content. The SIS facility may reside above the Windows(copyright) 2000/NT(copyright) file system (NTFS) as a filter driver, while the link file may be a sparse file including a reparse point identified by a SIS tag. SIS operates transparently, in that each file system request directed to the link file (e.g., open, write, read, close, delete, and so on) ultimately reaches the SIS filter, which then handles the request as if the link file was a normal file. For example, a file open request opens both the link file and the common store file (if not already open via another link). Writes to a SIS link file are written to the link file, and the SIS filter records the written portion of the file as dirty, thus preserving the logical separation of link files from one another. Read requests are intercepted by the SIS filter, which then reads clean portions from the common store file and any dirty portions from the link file. When the file is closed, the link file and the common store file are closed (unless the common store file is open via another link file). In the event that the link file has been written, then the sparse portions of the link file are filled in with clean data from the common store file, and the link file reconverted to a normal file.
The common store file maintains a backpointer identifying each link file that points to it. When a link file is reconverted to a normal file, (i.e., it was written to and then closed), the backpointer is removed from the common store file. Also, when a link file is deleted, the backpointer at its common store file is deleted. When no backpointers remain in a common store file, the common store file is deleted, since it is no longer needed to store the data for a link file.
The present invention provides security via a special signature to prevent unauthorized access to the common store files. A volume check facility that repairs inconsistencies in SIS metadata is also provided. In the volume check facility, the backpointers of each common store file are checked against the links in the file system, and the backpointers or links repaired as necessary.
Other advantages will become apparent from the following detailed description when taken in conjunction with the drawings, in which: