Programs which track file activity on a file system, such as a program that automatically generates and maintains an index of a file system's documents, are becoming important user tools. This is particularly true as storage mechanisms store an ever-increasing number of files. For example, Microsoft Corporation's Index Server product indexes the contents and properties of documents on one or more file systems, and provides users with a search engine to query the index and quickly locate matching documents. A desirable feature of Index Server is that it runs in the background to automatically keep its internal indexes up-to-date as the user modifies documents. Newly created documents are indexed, modified documents are re-indexed, and deleted documents are removed from Index Server's indexes as the user changes them. Index Server monitors the file system for changes to documents using Win32 Application Programming Interfaces (APIs) FindFirstChangeNotification and FindNextChangeNotification. These APIs function reasonably well but inefficiently, as they provide information on virtually all file activity, including transitory information that is not important to the overall indexing scheme. Moreover, existing mechanisms often provide change notification before the document is available for indexing, (e.g., when the file is still locked open by the user).
While Index Server and other such programs have thus become important user tools, under certain circumstances such programs are halted. The user can continue to modify documents even when these background programs are halted, however, resulting in a situation in which the file system activity is not properly tracked. For example, Index Server may crash due to a program error, a condition referred to as an abrupt shutdown. For a dynamic indexing program such as Index Server, this results in out-of-date indexes. Moreover, a feature of Index Server is that during low resource conditions, (e.g., disk space is low), the program will stop accepting document change notifications. The user can continue to make document changes during the low resource condition, however Index Server's indexes are not kept up-to-date as the Index Server program will not accept changes until sufficient disk space becomes available. Again, such a situation results in out-of-date indexes.
Once the indexes become out-of-date, Index Server needs to determine the list of documents that were changed since Index Server was halted or since it stopped accepting change information. One possible solution is to simply re-index all files from scratch upon every start-up, however this solution becomes unacceptably slow even with only a few thousand documents.
A more sophisticated solution is to compare each document in the file system with the information on that file maintained in Index Server's indexes, and synchronize any differences between the two. Properties stored in the index include the document name, the physical path of the document, the write time, i.e., the time that the file was last modified, and the document's ID, which is a unique identifier for every document in the internal index. After start-up, a tree traversal of the indexed volume is performed, which scans every file on the indexed volume. The tree traversal may use well-known algorithms such as a breadth-first or depth-first searching algorithms. During the tree traversal, a process looks for whether documents present in the file system are also present in the index. Documents which are not present in the indexes but are seen in the file system are new documents that were created after the shutdown of Index Server, and thus each such document is added to a list of documents to be indexed. Conversely, documents which are present in the indexes but are not seen in the file system are documents that were deleted after shutdown of Index Server, and thus each such document is removed from the internal indexes. For documents that exist in the internal indexes and that are seen during the tree traversal, the current write time of the document in the file system is compared with the write time of that document in the internal indexes. Those documents having write times in the file system that are more recent than the write times in the indexes have been modified since the shutdown, and are thus re-indexed.
While the above-described tree-traversal solution is significantly faster than re-indexing all documents at start-up, the solution is still too slow for large indexes. For example, indexes having hundreds of thousands or even millions of documents may require a processing time on the order of hours to complete this process. As indexes grow larger and larger, the time becomes even more unacceptable.