Due to decreasing costs of semiconductor memory, it has become feasible to use relatively large amounts of main memory for file system or disk caches. In some computing environments, quite large read-hit ratios are attained when using such caches, in many cases well over 90%. A typical example of such an environment is interactive single-user computing: after a start-up period in which the programs and data necessary for the work such a user might be doing on a given project are loaded into memory (the "working set"), the programs and data may remain memory-resident for long periods of time. However, due to 1) the high cost of a reliable uninterrupted power supply of sufficient capacity to support a large main memory, and perhaps more importantly 2) the possibility of main memory contents becoming damaged in unpredictable ways due to user errors, operating system errors, application program errors, or computer viruses, it is necessary to write modified blocks out to a non-volatile secondary memory.
It is most cost-effective to use the same secondary memory for 1) loading of programs and data during a start-up period, 2) handling read misses, and 3) as the non-volatile store for writing modified data blocks, i.e., the disk storage as used by the file system. Note that after a start-up period in which a working set is loaded into main memory, the disk will be used primarily to handle random writes. Most current file system designs, however, which originated when main memory was much more expensive and consequently smaller, are not optimized for this kind of disk workload. Rather, they are optimized for sequential reads.
The problem of optimizing a file system for mostly random, write disk access has previously been addressed in the design of log-structured file systems. See, e.g., Rosenblum and Ousterhout, "The design and implementation of a log-structured file system", Proc. Symp. Operating System Principles, 1991. A log-structured file system (LFS) has the problem that it is necessary to periodically reorganize in order to generate new (free) log areas. In Rosenblum the write-cost for a given storage utilization was defined as the average number of blocks accessed for each block written. For reorganization purposes a disk is divided into segments. Suppose segments are of size S blocks, and that all segments have storage utilization u. That is, a segment consists of a number S of physically contiguous blocks, and of these blocks, the number of in use blocks is uS and the number of free blocks is (1-u)S. Then after (1-u)S blocks are written to the log area in the segment, reorganization of the segment will eventually be necessary. Under the "copy and compact" reorganization method, as described in Rosenblum, entire segments are read and then all in-use blocks are written back contiguously. In general, for each segment, this will require S reads (although some of the blocks on a given track in a segment may be free, if any are in-use it is necessary to read the entire track) and uS writes (where the numbers of reads and writes are given as numbers of blocks). It follows that the write cost (as defined in Rosenblum) under these simplifying assumptions is ##EQU1##
A somewhat less expensive reorganization method is to select a source area to clean and a target area to which in-use blocks from the source will be moved (this avoids reading and then writing the same tracks). The inventors of the present invention have discovered that a similar simplified analysis for this method gives a write cost of ##EQU2##
Simulations show that in practice these write costs imposed by reorganization are worst-case results: improvements can be obtained by, for example, selecting the least utilized segments to clean (the above write-costs are called "no variance" write costs since it is assumed that all segments have the same storage utilization). However, the resulting write cost curves still have the same shape, and tend to climb sharply above 60% storage utilization. The no-variance write costs, and write costs generated by simulations (developed by the present inventors) in which the least utilized segments are chosen for reorganization, are shown in FIG. 4.
The only related method for handling disk writes of which we are aware is the write-ahead-dataset (WADS) method used by the Information Management System, (IMS), (see Strickland, Uhrowczik, and Watts, "IMS/VS: an evolving system", 21 IBM Systems Journal No. 4 (1982), pp. 490-510). An IMS WADS is a temporary location in which log records are stored. It requires a specially formatted count-key-data architecture type disk in which tracks are set up with records in which all keys are zero. Special channel programs are used in which a "search for key=0" command precedes the write. Since all records have a zero key this results in the write taking place to the first record to reach the disk head, minimizing rotational latency. However: 1) only one in-use record can be stored per track, 2) there does not seem to be any way to extend this to write to the first record on any track, and 3) records are only stored temporarily (since the method can only use a fraction of the disk space), and are re-written to other "ordinary" data sets as soon as it is convenient, which logically frees the corresponding WADS records (and associated tracks).
In contrast, the present invention, described below, always has a write cost of one, and for a typical small systems disk architecture in which there are 14 tracks per cylinder and 64 K-byte blocks per track, the disk write latency is less than one-sixth of a disk rotation for storage utilizations up to 90%.
Thus, the present invention provides a method and system for optimizing sequential reads.