1. Field of the Invention
This invention relates generally to the I/O operations in a computer system, and more particularly to enhancing synchronous I/O operations to a peripheral storage device which is coupled to the computer system.
2. Description of the Prior Art
Many different computing systems are available today but most of these systems are built around fundamental components such as those illustrated in FIG. 1. Typically the fundamental components of a computer 20, include a central processing unit 10 (CPU) which is connected through an input bus 11 to an input module 12 and through an output bus 13 to an output module 14. CPU 10 is also connected through data buses 15, 16 to a memory unit 17.
CPU 10 provides control and computing functions. Input and output modules 12, 14 are used to communicate between the computer user and CPU 10. Input module 12 supplies information to CPU 10. Typical input devices are a keyboard and a mouse. Output module 14 displays information from central processing unit 10. Typical output modules include a video display monitor, a printer and other visual display means such as plotters. Input module 12 and output module 14 are frequently referred to as input/output (I/O) units.
Memory unit 17 typically contains two general types of information, computer programs and data. A computer program is a sequence of instructions that are executed by the computer to perform a specific function. Data in memory unit 17 are processed by CPU 10 in response to the instructions from a computer program which is executing in CPU 10. The elapsed time required for a computer to complete the sequence of instructions in a program is the execution time of the program.
Memory unit 17 typically includes mass memory 17A, sometimes called secondary memory, and main memory 17B. Main memory 17B is a relatively fast memory, i.e. a typical access time is in the range from 20 to approximately 400 nanoseconds. Access time is the time interval between when CPU 10 requests data from memory 17 and when memory 17 makes the requested data available to CPU 10. Main memory 17B is usually used to store at least a portion of the program currently being executed by CPU 10 and data required by this program. Mass memory 17A, such as disks and tapes, is used to store programs, data, or portions of either programs and data which are not needed immediately by the CPU or that cannot be accommodated in main memory 17B because of size limitations of main memory 17B. Since programs and/or data are transferred in and out of mass memory 17A at the direction of CPU 10, mass memory units are typically included in the generic term "I/O units."
Mass memory 17A, is significantly slower than main memory 17B. Access time for mass memory is typically on the order of tens of milliseconds. Hence, if mass memory is accessed frequently by a program, the execution time for the program is limited by the mass memory access time.
In addition to speed, main memory 17B and mass memory 17A are typically differentiated by their response to a power failure. Main memory 17B usually consists of dynamic random access memory (RAM). RAM is volatile, which means the state of the RAM can be lost, for example if power is lost to the computer, or the computer crashes. Conversely, mass memory is usually non-volatile. Non-volatile memory, sometimes called "stable storage," is memory that maintains its state when power is interrupted. Ideally, computer programs and data are placed in mass memory 17A for permanent storage and moved into main memory 17B only for execution. As used herein, a computer crash means that the main memory is either corrupted or lost, independent of the crash initiator. Crash initiators include, for example, a power loss, a defective disk drive or a bug in the computer operating system.
Computer 20, as illustrated in FIG. 1, is typically a single unit. However, some I/O modules are quite expensive and often are not used sufficiently by a single user to justify their expense. In addition, when data is contained on a device in a single computer, the data is not directly accessible by other users who may need the data. Networks of computers are formed, as shown in FIG. 2, wherein several computers 20-1 to 20-N, usually similar to computer 20 illustrated in FIG. 1, are coupled to a common computer 21 commonly called the server. Server 21 is in turn connected to peripheral devices which may be accessed by computers, 20-1 to 20-N, sometimes called workstations, through network 22. Network 22 may be, for example, an Ethernet network, which is defined by IEEE 802.3 specification. Such a network permits common usage of data files as well as I/O devices which are connected to server 21.
In general terms, a computer operating system is used to control the operation of CPU 10 (FIG. 1), main memory 17B and I/O modules 12 14, 17A. In addition, the operating system provides an interface between computer user applications and the hardware. As used herein, hardware refers to the physical components of a computer system.
An operating system is ideally implemented as a software system, because a software system provides the greatest degree of flexibility, as discussed more completely below, for any additions and/or modifications to the operating system. As used herein, software refers to a computer program.
The operating system of a computer typically includes a kernel which (i) creates user processes, as described below, (ii) schedules execution of user processes, (iii) provides user processes with system services, and (iv) services hardware interrupts and exceptions.
The definition of a "user process" requires an understanding of the sequence of steps by which user source code, i.e., a computer program, is transformed into a series of instructions for CPU 10. User source code for an application program, which is to be run on a computer using a particular operating system, is typically compiled and linked into a binary file which is known as an executable image. A "user process" is the execution of an executable image by the kernel of the operating system. In the UNIX operating system (UNIX is a registered trademark of AT&T Bell Laboratories), for example, compilers, I/O utilities and terminal command interpreters are implemented as user processes.
An operating system for a network includes not only the operating system as described above for a single computer, but also additional software that coordinates the operation of the workstations and the server with respect to shared resources. The operating system for either a computer in a network or a standalone computer is usually loaded into main memory 17B when the computer is booted. As used herein, "booted" refers to the sequence of operations that are performed when either power is first applied to the computer or the computer is reset.
In either a network of computers or a standalone computer, the operating system must interface CPU 10 (FIG. 1) and main memory 17B with I/O modules 12, 14, 17A connected to CPU 10. To facilitate coupling of I/O modules 12, 14, 17A to CPU 10, the operating system typically includes at least one device driver for each type of I/O module that is normally connected to CPU 10. A device driver is a part of the kernel of an operating system, and provides a well-defined interface to the proprietary portion of the operating system. For example, a disk driver interfaces the proprietary portion of the operating system with a disk controller which in turn drives a disk. Note that a controller and a disk may be supplied by different vendors. Thus, the disk vendor designs the controller to interact with the disk driver in the operating system and as a result, the disk is easily connected to any computer using the operating system.
The operating system kernel may include device drivers for the secondary memory, the input modules and the output modules for example. Many operating systems (such as UNIX, VMS, a registered trademark of Digital Equipment Corp., MS-DOS, a registered trademark of Microsoft Corp., and OS-2, a registered trademark of International Business Machines, Inc.) are written so that a user may interface new devices to the CPU by supplying a device driver, i.e., a computer program for the new device, which can be incorporated in the operating system executable image.
To install a new device driver in the operating system kernel, the user must often write the device driver in the computer language of the operating system kernel and compile the new device driver into an object module which is subsequently linked with the operating system kernel executable image. To implement the new kernel, the executable image must be loaded and booted in the computer. However, some systems allow a device driver to be loaded into a running kernel. Thus, a computer operating system can be considered as consisting of two parts: (i) a proprietary part, sometimes called the proprietary kernel, which constitutes the bulk of the operating system and in fact controls the CPU; and (ii) a configurable part, which allows users to add device drivers to the operating system kernel.
In a network, the proprietary kernel, as used herein, includes network operating software. Thus, the operating system kernel includes network control code as part of the proprietary kernel. For example, in a network of computers operating under a version of UNIX containing Sun Microsystem's Network File System (NFS), NFS supplies file access to many client machines on the network while UNIX controls the individual client machines. NFS is designed to provide access to files over the network. Client machines with no disk can access remote files transparently using NFS. Client machines with a local disk can use NFS in addition to their local disks to gain access to remote files. To the client machines, NFS files are accessed in the same way as local files.
Independent of the ability to develop customized device drivers for I/O modules, the execution speed of a program with a standalone computer or a network is usually limited by I/O operations because, as discussed above, the actual operational speed of mass memory 17A is significantly slower than the operational speed of CPU 10 and main memory 17B. Several different approaches have been attempted to enhance the perceived and/or operational speed of mass memory so as to enhance the overall performance of the computer system.
Disk accesses on the average appear faster by using main memory as a cache. For example, if a file residing on a disk is accessed, a copy of the file is kept temporarily in a main memory cache. Subsequent accesses to the file are made to the main memory cache. Thus, the subsequent accesses are performed at main memory 17B access speeds rather than at disk access speeds. However, main memory storage is much more expensive per byte than disk storage, and other demands will be placed on main memory so the file cannot remain permanently in main memory. Eventually, the file must be returned to the disk and the memory reclaimed.
This caching strategy works perfectly as long as no modifications are made to the file, because the copy in memory and the copy on the disk are the same. However, if the file is modified (written), the copies in memory and disk may differ. For example, assume that a file on disk contains the data shown below:
______________________________________ Disk Memory ______________________________________ A B C D ______________________________________
When the file is read, a copy of the file is brought into memory:
______________________________________ Disk Memory ______________________________________ A A B B C C D D ______________________________________
Thus, initially the copy on disk and the copy in memory are the same. The next time this file is read, the file is read from memory at a fast rate. If a change is subsequently made to the file, i.e., replacing the "B" with an "X", the copy of the file in memory and the copy of the file on disk are now different.
______________________________________ Disk Memory ______________________________________ A A B X C C D D ______________________________________
Subsequent reads of the file access the modified data in memory at a fast rate. However, if the computer crashes main memory 17B is either corrupted or lost. When the computer is restarred after a crash, the contents of the disk are the only version of the data available. Thus, the file does not contain the "X" but rather the "B".
The UNIX file system tries to reduce the problem associated with a computer crash by periodically writing, i.e., flushing, the contents of the main memory cache back to disk. However, if a crash occurs between flushes, the data entered in the main memory cache after the last flush is lost.
In a network operating under UNIX and NFS, the server through NFS supplies file access to many client machines. Since a crash of the server (the machine providing the access to the files on disk) could destroy the work of many clients, NFS uses a stateless protocol to make recovery after a computer crash simple. A stateless protocol is one in which each update to a file is done synchronously. That is, when an update occurs, the update is written through to the disk--there is no inconsistent copy in memory. If the server should crash before an update occurs, the client simply continues to send the data to the server. Once the server has been repaired, the data are accepted, the client is told that the data were accepted, and the client can continue. If the server should crash after an update occurs, but before the client heard that the update happened, the client still tries to send the data until it is assured that the data is safely on the disk. When the server is repaired, the client writes the same data again. This is not a problem, since writing the same data two or more times is equivalent to writing the data once.
To illustrate the operation of NFS, consider the previous example again. Initially, the data on disk and in memory after the initial access were:
______________________________________ Disk Memory ______________________________________ A A B B C C D D ______________________________________
As described above, subsequent reads of the data now take place at memory speeds, rather than disk speeds. A client replaces the "B" with an X When the client sends the "X", the X is copied to both memory and disk, as shown below, so that all copies of the data are consistent.
______________________________________ Disk Memory ______________________________________ A A X X C C D D ______________________________________
Hence, if there is a crash, the updated data is safely stored on the disk. When the crash is repaired, the correct copy of the data is on disk.
NFS is not the only application that benefits from this kind of caching. Database applications may also perform synchronous writes to disk to ensure this kind of reliability with respect to computer crashes. Unfortunately, the synchronous writes, which are used to preserve data integrity, degrade the speed performance of the computer system. In fact, if the data is updated frequently in comparison to the number of reads of the data, the synchronous writes cause the performance to approach that of a system without the cache.
Another approach for increasing the performance of secondary memory has been to replace the high density storage of a disk with less dense non-volatile memory. Non-volatile memory is typically RAM with an uninterruptable power supply, e.g., batteries. For example, in a RAM disk, non-volatile memory is substituted for the disk. The replacement RAM disk provides high performance, but the RAM disk is a very expensive disk. A RAM disk made with volatile memory will lose the stored information if power to the memory is inadvertently interrupted. Finally, the high cost of RAM limits the storage capacity of a RAM disk.
Another approach to enhancing the speed performance of secondary memory is to use a non-volatile memory cache on disk controllers. When writes are issued to a disk attached to such a controller, the data first go into the non-volatile memory cache on the disk controller. Later, the written data are flushed from the controller's non-volatile memory cache to the actual disk. Since the controller is not part of the computer operating system, the controller cannot use knowledge the operating system has about when caching is beneficial. For example, a read-only filesystem will not benefit by non-volatile write caching. However, since the controller cannot tell anything about the filesystem, the controller caches all requests for the entire disk, including those for a read-only filesystem. Thus, the controller non-volatile memory cache is used wastefully. Further, the controller only speeds up requests for disks attached to the controller and cannot help amortize the cost of the expensive stable storage among different disks attached to different controllers.
Finally, many subtle issues with respect to data integrity are difficult to address without interacting with the computer operating system. A disk controller that has no special interaction with the operating system cannot, for example, flush the non-volatile memory under operating system control. If the computer system is brought down to repair a disk attached to the controller, any valid data still remaining in the non-volatile memory must be flushed prior to the repair or that data will be lost. A manual interaction with the controller must be used to flush any such data rather than having the operating system shutdown procedure initiate the flush.
The IBM 3990 disk controller is another example of a controller which possesses non-volatile memory. However, this memory is not used as a cache, and the controller does not have software support for recovery after a crash. Further, the operation of the non-volatile memory is not transparent to a user because specific channel commands must be issued to access this memory.
Another approach to maintaining data integrity during a computer crash is to modify the operating system to include transaction support. Conceptually, in a transaction support system, the operating system stores a group of operations in a transaction log. The computer performs the group of operations in the transaction log only after all operations in the group are completely entered in the transaction log. The transaction log is maintained in stable storage, typically a non-volatile memory.
If the computer crashes before the group of operations is completely written from the log to the disk, the group of operations can be restarted when the computer is restarted. If the transaction log was not completed prior to the crash, no operations are done and the log will be reset. This atomic "all or none" property is crucial to transaction systems.
Operations with filesystems can also use transaction support to simplify error recovery. Since data in a file are stored as a description block plus data blocks, updating the description and data blocks atomically ensures that the file is maintained in a consistent state. The IBM RT computer uses this technique. While these methods provide for enhanced recovery after a crash, they do not function as a cache and may require special operations to use the non-volatile memory.
Thus, the problems associated with the slow access speed of secondary memory and data integrity following computer crashes have not been satisfactorily solved. A main memory cache provides enhanced performance for accesses to data that are normally maintained in mass memory, but when using synchronous writes which bypass the main memory cache to protect against crashes, the system performance suffers. RAM disks also provide enhanced performance, but the expense of RAM limits the applications of these substitute disks. A stable storage cache in a disk controller only benefits the performance of disks connected to the controller. Further, since the controller is not a part of the computer operating system, the cache is inefficient and special procedures must be developed to prevent loss of data from the cache upon removal of a disk connected to the controller. A means for enhancing the performance of accesses to secondary memory which includes the speed performance of a main memory cache and the protection of synchronous writes against crashes is currently not available.