1. Technical Field of the Invention
This invention relates to scheduling of input/output functions relative to input/out (I/O) devices in a multitasking data processing system. More particularly, it relates to a system and method for enabling applications to influence the order of service of an I/O request after the I/O request has been submitted.
2. Background Art
In contemporary multitasking data processing systems, such as is illustrated in FIG. 1, the operating system 32 of processor 20 schedules input/output operations with respect to input/output device 24 through device interface 34 over I/O bus 28 responsive to I/O requests from applications 30. Typically, I/O device 28 is a long term storage device, such as a magnetic disk. Main storage 22 includes program and data storage which is accessible to processor 20 over bus 26.
Referring to FIG. 2, in one contemporary multitasking data processing system, an operating system 32 schedules input/output tasks or operations in a first-come first-served manner. In such systems, if blocks A, B and C are written to disk, they would complete in that specific order. A user application 30, which may or may not be running on the same system (for example, the file system could be a distributed file system), makes file access or update requests to the file system 40. The file system then determines the disk I/O needed to satisfy the file requests made by user application 30 and issues I/O requests to the I/O driver 42. The I/O driver formats the requests to a format required by the Media Manager 44 and sends the requests on to Media Manager 44 for processing. The Media Manager 44 performs no optimization or I/O ordering for efficiency, it handles requests first-come first-served. The Media Manager 44 creates the channel program 52 and sends the channel request to the device driver 46 which sends the request to the device 24 via the channel 28. The I/O driver 42 represents I/O requests by a control block 54 that can later be waited on by units of work (hereafter referred to as tasks) that are processing the file requests from the application. All I/O performed in the system is asynchronous, that is, I/O requests are made to the I/O driver 42 and must later be waited on to ensure completion via another (WAIT) request to the I/O driver 42 and specifying the corresponding I/O control block 54.
In this system, which is a logging file system, in order to reduce I/O rates in general, a group of blocks of data are read/written with respect to I/O device 24 at a time, rather than one physical disk block at a time. Metadata updates are logged to a logfile on disk, and metadata is only written to disk at periodic sync intervals and when the logfile is getting full so committed transactions can be freed and space made available in the logfile. Referring to FIG. 3, file system layer 40 uses five buffers in memory 22 to contain most recent written logfile pages 65-69. When all five buffers 65-69 are I/O pending, file system 40 waits for first log file page 65 to complete so it can re-use memory 22 buffer for a different logfile page. When writing large files, the disk I/O access pattern illustrated in FIG. 3 occurs repeatedly. That is, when application layer 30 writes large (over 1M) files to file system layer 40, file system layer 40 buffers and writes user data in 64K increments 61-64, etc. Thus, a 1M file entails 16 I/O requests 50 for user data, and would repeat the access pattern 61-65 four times. As logfile pages 65-68 fill, they have to be written out to disk 24. Thus, writing a large file usually mans writing some user data 61-64, and then writing a logfile page 65. Since, in this specific system, there are five buffers 65-69 for logfile information, the first five instances of pattern 61-65 occur really fast. The first 1M file (which is four instances of pattern 61-65, involving log files 65-68) completes with no I/O waits and gets over lOM per second response time. The next 1M file, however, runs into the problem that the five buffers for the logfile I/O are pending I/O completion, and must wait for at least one of these I/Os to complete before proceeding. Thus, extra I/O wait time is incurred because the I/O requests for four data blocks 61-64, and one logfile block 65, must complete when writing the second file, and this repeats every 256K that is written to the second file. Note that when writing the second file, the user task is being made to wait for the I/O to complete for the first file (unnecessarily). This pattern repeats for subsequent files being written. This results in performance equivalent to synchronous file write performance instead of the asynchronous file performance the first file received. Any file after the first will get this synchronous write performance until it comes time to free up space in the log file 60 by writing metadata or the sync daemon (which writes metadata) runs. The default sync daemon runs every 30 seconds. When this happens, metadata write is added to the I/O queue in 256K increments and file system 40 waits for this I/O to complete before it can proceed. When log 60 is full, or a sync operation occurs, and since I/O is first-come, first-served, a wait for all preceding I/Os that are not completed occurs, and performance for the file being written goes way down to less that 300 KBs/sec.
The performance of I/O writes is then very erratic based on where the system is in the cycle when application 30 submits file write requests. Performance may be very fast because log buffers are available; sync rate because the logfile buffer must be made available before it can be written; very slow because metadata is being written out and all I/O needs to be complete before continuing; or very fast because as soon as the metadata is written, the system has all logfile buffers available to it.
In some contemporary multitasking data processing systems, operating systems execute various approaches for scheduling input/output requests or operations in sequences based upon task priorities.
For example, in one such system, U.S. Pat. No. 5,220,653 by F. Miro for scheduling input/output operations in multitasking systems, of common assignee, the teachings of which are incorporated herein by this reference, an improvement on elevator I/O scheduling is described. Elevator I/O scheduling is scheduling based on which part of the disk the I/O will read or write. The Miro priority scheme allows a priority to be associated with the I/O request when it is scheduled and that priority is used to order the I/O in importance. Miro also provides algorithms for dealing with starvation. Thus, Miro provides for sorting I/Os based on priority in such a manner as to prevent starvation. However, Miro does not provide a method or system for allowing an application to adjust I/O priority after the application has submitted to I/O request. Because an application does not necessarily know at the time of submission the priority required by its I/O request request, there is a need in the art for a method and system which allows the application to adjust the priority or even cancel the I/O request after its submission.
Generally, contemporary systems require that a user submit priority indicia concurrently with an I/O request, and then is at the mercy of the I/O code which handles the request to do any performance optimization, such as writing out data that is closest to he current device head next, or write out the smallest file next, and so forth. In such systems, the user application has no control or influence over its I/O requests after the I/O request is submitted.
The above problem and problems like it cannot be solved by the prior art--either using the elevator disk scheduling technique and/or the priority sorting based on tasks since they will not guarantee that the scheduler wait only on the I/O that is important first. The above example assumes only large files are written, but in a file system, what the future holds is not known, it is not known if a user application will write more data to the same file, write another file, remove a file, etc. Having no knowledge of the future, the scheduler can not predict whether a log, metadata or a user data page will be needed to be completed first. Thus priority and elevator solutions will not accomplish what needed. Furthermore a scheduling solution is needed which is general in the sense that it can be applied to any program that needs to manage its I/O in an order not related to initial assigned priority or disk head position.
It is an object of the invention to provide an improved system and method for I/O request scheduling.
It is a further object of the invention to provide an I/O system that minimizes the amount of time tasks need to wait for I/O, which for logging file systems will provide full asynchronous write performance which should yield throughput close to memory transfer speed and not physical device speed.
It is a further object of the invention to provide a system and method which enables an application to adjust the priority of an I/O request after that request has been issued but before the I/O has been issued to the physical channel connected to the device.
It is a further object of the invention to provide a system and method which enables an application to cancel an I/O request after that request has been issued.
It is a further object of the invention to enable a user application to adjust I/O priority when information is available identifying which I/O are most important.