Existing data processing systems provide the capability to index the contents of files. These systems typically store indexing data in a data structure so that it can be accessed and manipulated to answer user-generated queries efficiently. An index contains indexing data derived from a file that allows queries about the content of the file to be performed efficiently. For example, an index of a file may contain an alphabetical listing of each major word of the file along with the locations at which the major word appears in the file. These major words are known as "keywords" and the locations are known as "occurrences".
As the number of files requiring indexing in a data processing system increases, the system can no longer immediately index the data in each file. This inability to immediately index files creates a backlog of files awaiting indexing. The backlog of files is often maintained by storing in a table a file identifier uniquely identifying each file in the backlog.
However, a problem arises if the file identifiers listed in the table are lost, for example, due to a power failure, because the files associated with the lost file identifiers are never indexed by the computer system. In order to track files in a manner which ensures that the files will be indexed, conventional systems store the table on a non-volatile store, such as a disk drive. However, storing the table in a non-volatile store has a negative impact on system performance because of the relatively large amount of processor time that it takes to write data to disk and to read data from disk. Therefore, systems employing this method track files inefficiently.
Other conventional systems track files by storing file identifiers for the files in a table in a non-volatile store, storing a time stamp in each file, and storing the file on a non-volatile store. The time stamp typically includes the date and time of a file's creation or last revision. When the file identifiers stored in the table are lost, the system restores the file identifiers to the table by comparing a time stamp from the last processed file with the time stamp stored in each file. For each file whose time stamp is later than the time stamp on the last processed file, the system stores the file identifier for the file in the table. However, a problem arises with this method of tracking files because this method is sensitive to uniformity of time. For example, during daylight savings time, when time "falls back" one hour, this method can fail to accurately track files after data in the table has been lost, because some files created or revised after the switch to daylight savings time will have an earlier time stamp than those files created or revised before the switch to daylight savings time. Therefore, prior systems employing this method fail to accurately track files.
While the example set forth above tracks files in order to index the data in the files, it should be understood that the purpose behind tracking files is not so limited. Instead, files may be tracked so that any of a number of types of processing can be performed on the files.