The invention relates generally to computer systems and data storage, and more particularly to identifying and merging files of a file system having common properties.
The contents of a file of a file system may be identical to the contents stored in one or more other files. While some file duplication tends to occur on even an individual user""s personal computer, duplication is particularly prevalent on networks set up with a server that centrally stores the contents of multiple personal computers. For example, with a remote boot facility on a computer network, each user boots from that user""s private directory on a file server. Each private directory thus ordinarily includes a number of files that are identical to files on other users"" directories. As can be readily appreciated, storing the private directories on traditional file systems consumes a great deal of disk and server file buffer cache space.
Techniques that have been used to reduce the amount of used storage space include linked-file or shared memory techniques, essentially storing the data only once. However, when these techniques are used in a file system, the files are not treated as logically separate files. For example, if one user makes a change to a linked-file, or if the contents of the shared memory change, every other user linked to that file sees the change. This is a significant drawback in a dynamic environment where files do change, even if not very frequently. For example, in many enterprises, different users need to maintain different versions of files at different times, including traditionally read-only files such as applications. As a result, linked-file techniques would work well for files that are strictly read-only, but these techniques fail to provide the flexibility needed in a dynamic environment.
Another problem with these techniques is that identifying identical files becomes a complex task as the number of files on a file system volume increases. For example, a disk drive may store thousands of files, and each time a new file is written to a disk or a file is changed, a potential for file duplication exists. At times a user may know when files are duplicates of one another, and thus can manually request that the file data be shared, however relying on a user to detect such conditions is unpredictable, and for large numbers of files, inefficient and/or impractical. One possible solution is to run a utility at system start-up that scans a file system""s files for duplicates, however this solution becomes unacceptably slow even with only a few thousand documents. Moreover, such a solution would not work well for users who seldom reboot a machine. Indeed, as more and more disk space is consumed, sharing files becomes a more valuable tool for preserving disk space, and thus a real-time solution could reclaim space when most needed. However, scanning even a relatively modest number of files in a file system volume for one or more duplicates, such as each time that a file is closed, consumes a great deal of time and machine resources, and thus is also impractical.
Briefly, the present invention provides a method and system for automatically identifying common files of a file system and merging those files into a single instance of the data, having one or more logically separate links thereto representing the original files. A groveler facility maintains a database of information about the files on a volume of a file system, the information for each file including a file size and checksum (signature) based on the file contents. The groveler includes a component that periodically acts in the background to scan the USN log, a log that dynamically records file system activity, whereby new or modified files detected in the USN log are queued as work items, each work item representing a file. The entire volume also may be scanned to add work items to the queue, which takes place initially when the queue is created, or when there is a potential problem with the USN log.
The groveler includes another component that periodically removes items from the queue, calculates the signature of the corresponding file contents, and uses the signature and file size to query the database for matching files. The groveler component then compares any matching files with the file corresponding to the work item for an exact duplicate, and if found, calls a single instance store facility to merge the files and create independent links to those files.
Other advantages will become apparent from the following detailed description when taken in conjunction with the drawings, in which: