One of the media utilized by computers to store data is a magnetic disk drive. Magnetic disk drives allow computers to alter the data that is stored thereon, such that data can be read from and written onto the magnetic disk. The data is usually stored on the disk in groups called files, with each file being independently accessible. The location on the disk where each file is stored is identified and stored in a data structure so that the computer can quickly access the necessary data when desired.
The storage area on the disk is divided into concentric circles called tracks. The number of tracks on a disk depends on the size of the disk. It is not uncommon for a disk to have over one thousand concentric tracks. Each track on the disk is divided into an equal number of sectors. Typically, the sectors are arranged in slices, such that the sectors at the outer edge, or beginning, of the disk take up more linear space along a track than the sectors near the inner edge, or end, of the disk. However, the data stored within each sector is arranged such that each sector contains an identical amount of data. The location of each sector is stored in a special data structure on the disk, thereby making each sector independently accessible.
The sectors of each track are grouped into clusters. The grouping of sectors into clusters is performed by the operating system and thus is not a physical delimitation. Every track on the disk contains the same number of clusters, and every cluster on the disk contains the same number of sectors. A cluster usually comprises from one to sixty-four sectors, depending on the size of the disk, with each sector storing 512 bytes of data. Because each sector is independently accessible, and clusters are mapped to groups of sectors, each cluster can be independently accessed.
The operating system of a computer stores files in one or more clusters on the disk. A cluster thus contains the contents, or a portion thereof, of a single file. Large files may require several clusters to hold all of the data associated with the file, but extremely small files can be stored in a single cluster. Because a cluster only stores data from a single file, several sectors of the cluster will not be used to store other data if the file does not contain enough data to fill all the sectors of the cluster.
Magnetic disk drives include a stack of several rigid aluminum disks, or platters, coated with magnetic material. Each platter stores data on both its top and bottom surfaces. Data is encoded on the each of the disks by magnetizing areas of the disk surface. Data is retrieved or added to the disk by a read/write head. Each head of the magnetic disk drive consists of a tiny electromagnet attached to a pivoting arm, with the electromagnet being positioned very close to the surface of the disks. One such head is provided for each recording surface on each platter. The arms pivot to move the heads back and forth over the surface of the disks in a generally radial path. Head actuators pivot the arms to control the movement of the heads. When data is written onto a disk, the head is positioned over the appropriate area of the disk where the data is to be written. An electrical current, supplied to the head, produces a magnetic field which magnetizes a small area of the disk near the head. This small magnetized area represents a digital bit. Similarly, when data is read from a disk drive the head is positioned over the appropriate magnetized area of the disk, which induces a current in the head. The current induced in the head is then decoded into digital data.
The rigid platters of the disk drive are connected to a single spindle. The spindle is connected to a motor, which spins the disks in unison at a constant rate. Although the sectors of the disk may take up different amounts of space, the amount of data stored within each sector is identical. This allows the disks to spin at a constant rate to retrieve equal amounts of data regardless of the location of the sector on the disk.
When a new cluster on the disk is accessed, two mechanical operations must occur before the head actually begins to read or write data. First, the head, attached to a pivotable arm, is moved radially from the location of the current cluster to the location of the destination cluster. A certain amount of time is required for the arm holding the head to overcome the effects of inertia and friction so as to effect its movement. Additional time is required to allow the head to settle in a stationary position after its movement. Second, the head must wait for the appropriate cluster of the disk, containing the desired data, to spin beneath the head. Because the disk spins at a constant rate, the maximum amount of time that the head must wait for the desired cluster to pass beneath it is the time required for the platter to complete one spin. Therefore, each access of a new cluster creates an inherent delay due to the mechanical requirements of accessing the correct area of the disk.
The location of files stored on a disk must be maintained to allow the operating system of the computer to access files when desired. Several data structures exist on the disk that are used for file access. Each sector on the disk has a unique identifying number, based on the location of the sector on the disk, so that it can be easily accessed. Similarly, each cluster has a unique cluster number. Identifying numbers are assigned such that adjacent sectors and clusters, respectively, are consecutively numbered. The primary data structure used for determining what parts of the disk are being used for file data storage is the File Allocation Table (FAT). The FAT, stored near the beginning of the disk, contains an entry for each cluster on the disk. Clusters are listed in the FAT consecutively by their cluster number, beginning with the outermost clusters on the disk. The FAT entry for each cluster contains the number of the cluster in which the next part of that file is contained. The FAT entry for the cluster containing the last data of a file comprises an End of File (EOF) entry. Therefore, each file stored on the disk is represented in the FAT as a chain of one or more clusters. Thus. the FAT indicates whether each cluster is allocated to a file, but does not tell the operating system of the computer to which cluster a given file belongs.
Another data structure, called the Root Directory, hereinafter referred to as the Directory, includes a list of all files and subdirectories stored on the disk. The Directory differs from subdirectories in that the Directory is stored near the beginning of the disk prior to sectors used to store file data. Subdirectories are stored in sectors on the disk in the same manner as are files, and the location of each file and subdirectory is maintained in the Directory. The Directory has an entry, for each file, containing the cluster number in which the first part of that file is stored. By storing the starting cluster number of each file, the Directory ties each file to the FAT.
To access a file on the disk, the operating system of the computer reads the Directory entry for the file to determine the first cluster in which file data is stored. The operating system then reads the entire chain of FAT entries starting at the file's first cluster. Using the location of the first cluster and the chain of clusters from the FAT, the operating system can determine every cluster belonging to the file and access each cluster accordingly.
The amount of data that one disk can store is limited by the size of the disk and the number of tracks thereon. However, a technique called data compression allows the data in a file to be stored in less space on the disk. Compression programs, as are well known in the art, typically identify frequently occurring strings of letters, words, or phrases and replace each string with a respective token. Because each token requires a much smaller space for storage than does the respective string, the file data requires less storage space. A file stored using such a technique is called a compressed file. When the compressed data is to be read, it is first accessed and then decompressed to return it to its original form. The compression of files not only saves disk storage space, but also decreases access time because it is often faster to read and decompress a compressed file than to read an uncompressed file.
Unlike disk space allocation for uncompressed files, which are stored in cluster units, compression programs allocate disk storage space by sector units. Compression programs. therefore, support variable-length clusters for the storage of compressed data because only as many sectors as are necessary to store each cluster in compressed form are allocated for that purpose. Therefore, compression programs allow more than one cluster to be stored in the same sectors that would otherwise comprise a single, uncompressed cluster. Variable-length clusters, used to store compressed data, allow previously vacant sectors in a cluster to be allocated to store data from another compressed cluster. This allows each sector of storage space on the disk to be available for storing compressed files.
Disk compression programs increase the amount of data that can be stored on a disk by compressing the data stored in each cluster so that it can be stored in fewer sectors than it would normally require. The technique used by these compression programs is to compress all files on the disk and store them in a single, large file called the Compressed Volume File (CVF). Once the files are compressed and stored in the CVF, any attempt to access one of the files will be intercepted by compression device driver software. This device driver locates the file's compressed data within the CVF and executes the read or write request, decompressing or compressing the file data as necessary.
In order to locate the compressed data within the CVF, the device driver maintains several data structures. The FAT and Directory are maintained, and each contains the same information in the same manner as for an uncompressed drive. Another data structure, called the Sector Heap, occupies the majority of the space in the CVF on the disk. The Sector Heap is an array of sectors on the disk where the compressed data is actually stored.
Another data structure maintained is the Microsoft DoubleSpace File Allocation Table (MDFAT). Microsoft DoubleSpace is a data compression program included in MS-DOS. The MDFAT keeps track of where the uncompressed data for each cluster is stored in compressed form within the CVF. The MDFAT parallels the FAT and contains an entry for each respective entry in the FAT. For each FAT entry for a given cluster, the corresponding MDFAT entry indicates in what sector numbers within the Sector Heap that cluster's compressed data is stored.
A further data structure created for compressed drives is the BitFAT. The BitFAT is a table containing one bit for each sector on the disk. Each bit indicates whether the respective sector is being used to store data or whether the sector is unused, i.e., vacant. Therefore, available, vacant sectors can be found by scanning the BitFAT.
When data on a disk is deleted and new data added. fragmentation may occur. Fragmentation is the storing of pans of a single file in non-adjacent clusters on the disk. The operating system of the computer stores new file data in the first available vacant space. However, the first available vacant space may be too small to store all of the new file data to be stored. Therefore, a portion of the new file data will be stored in the first available vacant space and any leftover data will be stored in the next available vacant space. Unfortunately, the next available vacant space may be located at a distant portion of the disk. Thus, a file may be stored in several non-adjacent clusters at various locations on the disk. Such a file, stored in non-adjacent clusters, is called a fragmented file. FIG. 1 a shows how files can be stored in fragmented clusters.
The storage of fragmented file data in non-adjacent clusters is undesirable and degrades the overall performance of disk drive operations. When the fragmented file is accessed, the heads must make multiple movements to each of the clusters in which the file data is stored. The heads must also wait for the appropriate cluster to spin around to the location of the head before reading or writing of data can begin. Not only do multiple head movements cream an undesirably long access time, they cause excessive wear of the heads, arms, and actuator. Therefore, it is desirable to defragment the disk such that all files are stored in contiguous clusters. Defragmentation speeds access times and lessens the mechanical wear of the head assembly because an entire read or write request can be executed without requiting multiple head movements between non-contiguous clusters, with the spin delay associated therewith.
The objective of defragmentation is to rearrange all files stored on the disk such that each file is stored in contiguous clusters on the disk. A further objective of defragmentation is to store all files in contiguous clusters at the beginning of the disk, thereby consolidating all vacant disk storage space at the end of the disk. Consolidation of vacant space at the end of the disk is beneficial because the operating system need not search as long in order to locate sufficient available space in which to store new data, and the new data is more likely to be stored in contiguous clusters, rather than in numerous smaller groups of clusters scattered throughout the disk.
Compressed file data can become fragmented in the same manner as uncompressed file data. The groups of sectors in the Sector Heap in which the compressed clusters of a file are stored may not be adjacent to one another. Additionally, the unused sectors within the Sector Heap may become scattered around the disk, rather than being consolidated at the end of the Sector Heap. Furthermore, because compression programs allocate disk space for storage of compressed data by sector rather than by cluster, a compressed disk drive is actually more prone to fragmentation than an uncompressed drive. This is demonstrated in FIG. 1 b, where it is seen that compressed data may occupy only a few of the sectors allocated to each uncompressed cluster, thereby leaving several vacant sectors within each cluster. These vacant sectors, freed up by the compression of each cluster, create additional fragmentation because the compressed data of a file, even if stored in contiguous clusters, is not stored in contiguous sectors. Rather, the compressed file data is broken into groups of compressed sectors interspersed with vacant sectors.
The prior method of defragmentation, utilized to rearrange compressed file data in contiguous sectors, requires two stages, or passes. The first pass consolidates the FAT and corresponding MDFAT entries for each file into adjacent cluster entries near the beginning of the disk, thereby consolidating all vacant clusters at the end of the disk. The prior method transfers into adjacent clusters not only the FAT and MDFAT entries for each cluster, but also the actual compressed data stored in each respective cluster. However, this first pass merely ensures that the compressed data is located in adjacent clusters, but does not ensure that the compressed data is located in contiguous sectors.
The second pass utilizes variable-length clusters to store and rearrange the compressed data into adjacent sectors within the Sector Heap. The cluster length of each respective compressed cluster is defined as the number of sectors required to store the data of the uncompressed cluster in compressed form regardless of the number of sectors in an uncompressed cluster. For instance, if all clusters on a given disk have eight sectors, but the data of a particular cluster, stored in compressed form, requires only four sectors, then the compression program sets the length of that cluster at four sectors, rather than eight. FIG. 1c shows how the defragmentation of compressed clusters results in each file being stored contiguously in adjacent sectors.
However, a disadvantage of the prior method involves the unnecessary movements of data. The actual file data stored in the Sector Heap is moved during the first pass, and again during the second pass. During the first pass, where the FAT and MDFAT are rearranged into adjacent entries, the corresponding compressed data for each respective cluster is moved each time the FAT and corresponding MDFAT entries for that cluster are moved. Because each FAT and MDFAT entry may be moved numerous times during the first pass before they are placed in their final position, the corresponding file data must be moved numerous times as well. Such numerous movements of data, requiting multiple movements of the heads to first access the data in its current location, read the data, and then relocate to the new location to write the data, create very long delays and inefficient operation of the defragmentation program. Contributing further to the inefficiency of the current method is that every movement of compressed file data requires a decompress/compress sequence. When compressed data is moved, it is first read, then decompressed, then transferred to the new location, where it is then compressed again before being written into the new location. These operations associated with the transfer of compressed data add to the inefficiency of the current method.
Furthermore, the prior method of defragmentation makes inefficient choices on where to temporarily relocate compressed data during the second pass to make room for the placement of data being moved to its final, defragmented location. Prior methods usually relocate data to the first available vacant space on the disk. Because of this "first fit" relocation, the data placed in a temporary location may subsequently obstruct other data being moved to its final, defragmented location. Therefore, the same data may be moved many times to different temporary locations before it is ultimately placed in its final location. These multiple movements of data result in further inefficiency of the defragmentation program. Because defragmentation programs using the prior method can take several hours to complete, it is desirable to eliminate as much inefficiency as possible from the defragmentation operations.