Just as a chain can be no stronger than its weakest link, the throughput of a high performance, general purpose computer can be no greater than the throughput afforded by the secondary storage (typically magnetic disk) management subsystem. While magnetic disk drives (hereinafter often referred to simply as "disks") offer the virtues of mass storage and low cost, their operation is characterized by a time scale that is orders of magnitude slower than that of the CPU. Thus, unless the disk access is effectively optimized, the CPU is likely to spend much of its time waiting for more data to process.
Effective optimization of access to a disk requires consideration of the three degrees of freedom of a disk. Broadly, the heirarchy of organizational entities on a disk includes cylinders, tracks, and sectors (or pages). These will now be described.
A disk typically consists of multiple rotating recording surfaces, each with an associated recording head (often referred to simply as a "head"). Each head is connected to an arm which is able to move with great precision to any one of a plurality of radial positions between the edge of the disk surface and a point near the center of the disk. All of the arms of a disk move coincidently; that is, all of the heads are at the same distance from the edge of the disk at the same time. Each head position defines a disk track which is the circular region of disk surface that rotates under the recording head. A track is conventionally broken up into multiple sectors, or pages. The collection of tracks under all of the heads at a given instant in time is known as a cylinder. It is sometimes advantageous to subdivide a cylinder into a small number of logical cylinders, each of which contains up to some predetermined number of pages.
There are various delays associated with accessing any part of the disk other than the sector that is currently under the head. One type of delay is rotational delay, wherein transfer must wait until the right sector has come under the head. This delay is zero if the read/write circuitry is active and the desired sector is the next one on the current track. Thus there is a certain premium on adjacency of pages to be accessed. Another type of delay is head switching in the current cylinder. On most disks, only one head can be active at a time. Thus, if the desired sector is in the current cylinder but is on a different track, the currently active head must be deactivated and the other activated. By careful positioning of data, the effects of this delay can be minimized but not completely eliminated. The most expensive operation is physically moving the head to another cylinder. The cost of this is roughly proportional to the distance moved, so there are again advantages to adjacency. Therefore, the disk management subsystem can achieve significant performance gains if arm movement and head switching can be minimized.
In modern operating systems it is the responsibility of the operating system to manage the mapping of files onto the underlying disk memories. This involves keeping tables which relate each file to the pieces of disk storage which make up the file and tables which keep track of the pieces of the disk storage which are currently not in use. These structures must allow for dynamic expansion of files and for the efficient allocation of free space. Managing these tables inevitably introduces overhead in the use of the files. A hidden cost is the cost of writing those tables back to disk each time they are modified. Hence, it would be an advantage to minimize the access time to the tables as well as to the data.
One common technique for managing the pieces of a file is to chain the pieces together. That is, each piece contains the address of the next piece (and sometimes the previous piece). This has the advantage that when scanning through files sequentially, the address of the next piece is already in memory, so there is no overhead in looking up the address of the next piece. The major disadvantage of the chained technique is that when accessing data randomly, the system must fetch all pieces between the current location in the file and the new one, even though the intermediate data will not be used.
Another common technique is to have a file map that collects into one place a list of all of the pieces of the file. The file map technique allows direct lookup of the address of the desired piece without requiring a scan of the data of the file. When accessing a file randomly, the map may be in memory after the first reference, reducing the average cost of random access. The only additional cost for sequential access is the memory cost of having the entire map in memory instead of just the address of the next piece.
The main problem with file maps is that they tend to be wasteful of space or quite complicated in internal structure. This is due to the fact that the majority of files are quite small, but the most heavily used files tend to be large. If the file map is small, complicated indirection schemes must be used to support large files. This tends toward the problems of chained approaches. If the file map is large, much space will be wasted on the majority of files which require only one or two pieces of storage.
The UNIX time-sharing system represents a compromise approach wherein direct access structures are used for small files and indirection is used for relatively large files. To this end, the file map includes thirteen pointers: the first ten are direct; the eleventh is indirect; the twelfth is doubly indirect; the thirteenth is triply indirect. This approach, while an improvement, is still limited in that files are restricted to relatively small sizes and the growth increment is not very smooth. (See "The UNIX Time-Sharing System", Ritchie and Thompson, The Bell System Technical Journal, Vol. 57, No. 6, part 2).
Disk space which is available for allocation to files is conventionally managed in a variety of ways. The most common is a bit map, with one bit per allocatable piece of storage. Another common technique is chained blocks or "extents". Whenever space is allocated it is removed from the free space table, and whenever it is deallocated it is added to the free space table. Free space tables are typically allocated as fixed sized, contiguous tables in a reserved part of the disk.
The techniques above are described in more detail in Madnick & Donovan, Operating Systems, 1974.
A major problem with the usual techniques is that the structures used to manage files on disks are typically physically distant from the space managed. This results in much undesirable arm movement during allocation, deallocation, and file access. For example, in extending a file it is necessary first to detect that extension is required, then to locate a candidate piece physically close to the current end of the file, to update the free space table, to update the file map, to initialize the allocated space on disk, and finally to allow the user to access the newly allocated space. These operations must occur in a strictly choreographed way to protect the data structures from corruption in the event of operating system or hardware failure.
An invention used in the implementation of the DEMOS file system (Powell, "The Structure of the DEMOS file System", design specification) was the cylinder block. On DEMOS, the cylinder block contains file descriptions, including a file map, and a free space map for the cylinder. This approach is advantageous in that the free space map and file map can be updated simultaneously when the space to be allocated was on the same cylinder as the file descriptor. The DEMOS file system was designed for the Cray-1 at Los Alamos National Laboratory, and the base design is clearly tuned for an environment that is characterized by an unusually large number of very large files (data from nuclear weapons experiments, largely). In addition, those files tend to be created once and never extended. Thus, the DEMOS system optimized allocation of files in large chunks.
The DEMOS cylinder block contains an allocation bit map and and array of 30 file descriptors as its major components. The allocation map is simply a bit map of the free pages on that particular cylinder. A file descriptor contains some control information about the given file and eight block group entries. A block group consists of some number of contiguous blocks (sectors) on the disk, determined at the time the block is allocated. Indirect block groups are possible, at some cost. There are 180 blocks (sectors) in each cylinder. It is not possible to have more files than there are file descriptors. Thus for the disk space to be fully utilized the average file size must be at least 6 blocks, or some 24,500 bytes. This is, of course, consistent with the concept of optimizing the system for allocating files in large chunks.
However, the environment for which the DEMOS system was designed varies radically from the typical case in commercial applications. Most existing systems have a very large number of small files which, once created, rarely grow. The large files in those systems tend to be data bases which grow over time by small increments. Moreover, for commercial systems the average file size is something more like 2000-3000 bytes. Thus in a typical commercial application the DEMOS file system would only be able to use 10-20 percent of the available disk space. Similarly, the manner in which indirect blocks are used in the DEMOS system penalizes files which grow in small increments. Thus, while the DEMOS design has significant advantages in its context, it is clearly unsuitable for use in the more common commercial systems.