One of the responsibilities of a file system is to map the relationship between the logical data in a file and the physical allocation units (e.g., clusters) located on a permanent storage volume wherein the data is stored. When the amount of useful data in a file is reduced in size, an application program dealing with that file notifies the file system of the reduced file size so that some of the disk space allocated to the file may be freed for reuse. If the data to be freed is at the front of the file, it is the responsibility of the application program to shift the remaining data to the start of the file and inform the file system of the new file size relative to the front of the file. The file system frees space by returning the clusters mapped to the end of the file to free space, essentially deleting the unneeded contents from the end of the file.
However, many applications process data in a sequential, i.e., front-to-back order. For example, in a merge application, two or more sorted source files are merged into a single sorted target file, at which time the source files are no longer needed. Such a merge is accomplished by sequentially processing data from each of the source files by combining the data according to the appropriate sort order and writing the combined data into a sorted target file. Because a merge application program often merges large files, (e.g., 500 megabytes), the source file reads and target file writes are repeatedly performed on small amounts of data until all of the source data is processed. Upon completion of the source data processing, the source files are typically deleted.
While the above-described merging approach is very straightforward, it requires that a large amount of free disk space be available during the operation. For example, if the combined sizes of the source files total 500 megabytes, the target file may also be as much as 500 megabytes in size. To perform the merge, up to 500 megabytes of disk space needs to be free before the source files can be deleted. This is true even though the free disk space is essentially temporary, since once the source files are deleted the total occupied disk space will be generally unchanged. Of course, the target file may be smaller than the source files, if some duplicate data was removed. As can be appreciated, such a large amount of free space is not always available on a given disk volume. Moreover, it is highly inefficient to have the application program regularly shift large amounts of data to the front of each file so that a source file can shrink from the back as its data is consumed.
To solve the above-described temporary space problem, a second approach to merging files is to write the merge application program to manage multiple, smaller files which together constitute a large logical file. The application program tracks how the smaller files compose the larger file, and manages the deletion of certain files to free up disk space as the data is processed. However, there is substantial complexity in managing the multiple files which constitute the large logical file. For example, the program will have to separate a large source file into smaller ones, name each file and maintain the logical relationships therebetween, essentially acting as a file system within a file system. Moreover, most operating systems limit the number of simultaneous open files that an application can have, and there is a performance penalty with a high number of simultaneous open files. To avoid having too many open files, even more complexity has to be added to the application program.
Other applications that similarly process data in a front-to-back order are those dealing with first-in, first-out (FIFO) queue files. With such a queue, new items are added to the end of the queue while unneeded items are removed from the front of the queue. A FIFO queue thus supports an EnQueue operation which adds a new item to the end of a queue, and a DeQueue operation, which removes an item from the front of the queue if the queue is not empty. An IsEmpty operation is also provided which tests if the queue is empty.
Dequeueing individual items is expensive with a persistent FIFO queue, that is, a FIFO queue stored on a permanent storage medium such as a disk. The expense is present because a substantial number of expensive input-output disk operations need to be performed to clean a dequeued item from a file. Indeed, with persistent FIFO queues, rather than clean each item from the file immediately after it is a dequeued, the program which cleans up the queue first accumulates a number of dequeued items by remembering the items, and later cleans those items from the file in bulk. Such batch cleaning of dequeued items amortizes the cost of dequeueing over a number of dequeueing operations.
A number of methods are known for cleaning up a persistent FIFO queue file having both dequeued (but not cleaned) items and remaining, non-dequeued items. A first method involves overwriting the dequeued data with the remaining data, i.e., shifting the remaining data to the front of the file, and then reducing the file size based on the size of the remaining data. This is accomplished by creating a temporary file equal to the size of the remaining data, copying the remaining data to the temporary file, and then copying the remaining data back to the original file starting at the front of the file. The temporary file is then deleted.
A second method is similar to the first, but instead of copying the temporary file data back to the original file, the temporary file becomes a new persistent FIFO queue file, and the old FIFO queue file is deleted. The file system renames or updates file header information with the name of the new FIFO queue file as necessary.
However, in both the first and second methods, temporary disk space needs to be created equal to the size of the non-dequeued data. Moreover, the first and second methods involve copying potentially large amounts of data, and copying data is very expensive.
A third method involves overwriting the dequeued data with the non-dequeued data by moving the non-dequeued data to the front of the file within the file itself. However, although no temporary free space is needed with this approach, substantial data copying still takes place in order to move the data. Moreover, if a system failure occurs during the copying, the file may be in an inconsistent state.
Lastly, the items may be maintained within a number of smaller, serially numbered files ranging from a first file to a last file. New items are appended to the last file until that file becomes filled, at which time a new file is created and becomes the last file, increasing the total number of files. When all of the items in the first file are dequeued, the first file is deleted, returning that file's space to the file system. As can be appreciated, this method requires the development and maintenance of an extra, complex layer of file management software.