File systems manage files and other data objects stored on computer systems. File systems were originally built into the computer operating system to us facilitate access to files stored locally on resident storage media. As personal computers became networked, some file storage capabilities were offloaded from individual user machines to special storage servers that stored large numbers of files on behalf of the user machines. When a file was needed, the user machine simply requested the file from the server. In this server-based architecture, the file system is extended to facilitate management of and access to files stored remotely at the storage server over a network.
One problem that arises in distributed file systems concerns storage of identical files on the servers. While some file duplication normally occurs on an individual user's personal computer, duplication unfortunately tends to be quite prevalent on networks where servers centrally store the contents of multiple personal computers. For example, with a remote boot facility on a computer network, each user boots from that user's private directory on a file server. Each private directory thus ordinarily includes a number of files that are identical to files on other users' directories. Storing the private directories on traditional file systems consumes a great amount of disk and server file buffer cache space. From a storage management perspective, it is desirable to reduce file duplication to reduce the amount of wasted storage space used to store redundant files. However, any such efforts need to be reconciled with the file system that tracks the multiple duplicated files on behalf of the associated users.
To address the problems associated with storing multiple identical files on a computer, Microsoft developed a single instance store (SIS) system that is packaged as part of the Windows 2000 operating system. The SIS system reduces file duplication by automatically identifying common identical files of a file system, and then merging the files into a single instance of the data. One or more logically separate links are then attached to the single instance to represent the original files to the user machines. In this way, the storage impact of duplicate files on a computer system is greatly reduced.
Today, file storage is migrating toward a model in which files are stored on various networked computers, rather than on a central storage server. However, the problem of duplicate identical files remains, except that the duplicate files are spread out over the various networked computers. Given the large number of computers that can currently be networked together (easily into the thousands or hundreds of thousands), and the large number of files that can exist spread out over this large number of computers (easily into the millions or billions), detecting duplicate files in such an environment can be very difficult. Limitations on the bandwidth available to transfer information among the computers, as well as limitations on the computational capacity of the computers themselves, makes such detections very difficult.
The invention addresses these problems, allowing locating of potentially identical objects, such as files, across multiple computers.