In general, the operation of a data processing system involves two basic functions: arithmetic and logic operations performed on data contained within a host computer, and, stated broadly, input and output control. That is, the data to be processed must first be input to the computer, while the computer output typically results in further storage of the data together with results of the operation. While it is possible to perform such processing in a way that the data input is processed and the output is generated without long-term storage of either the data or of the results of the operation, it is far more common, particularly in large processing systems, that the data files operated upon are permanent files such as payroll files, employee files, customer lists and the like, which are updated periodically as well as being used to generate actual computer output such as payroll checks and the like. It is therefore important that means be provided for long-term storage of such data.
Various types of memory means for long-term storage of digital data have been provided in the prior art. These comprise magnetic tape memory, magnetic disk memory, magnetic drum memory, solid state random access memory, magnetic bubble memory, charge coupled device (CCD) memory as well as others. The choice of which sort of memory is to be used for a particular operation inevitably involves a cost/speed trade-off; that is, the faster the access time provided with respect to any given bit stored on a given type of magnetic memory, the more expensive it is to store the bit. There has developed in general a heirarchy of memory according to which the central processing unit (CPU) of the computer comprises solid state random access memory (RAM); an intermediate high speed "cache" or "virtual" memory used in conjunction with the host may comprise a less expensive, less high speed form of solid state RAM or CCD memory. The next step in the heirarchy may be a fixed head disk rotating magnetic memory external to the CPU of relatively lesser speed, but capable of storing a vastly greater quantity of data at significantly lower cost; further down the heirarchy are moving head, non-replaceable magnetic disks of high density, user-replaceable disks of lower density and finally tape drives.
The prior art has been extensively concerned with improvement in methods of utilizing the various forms of memory so as to achieve higher efficiency of use of the various types of media available, to reduce costs, and simultaneously to devise methods whereby the various time limitations of the less expensive memory means can be overcome, thus also improving efficiency. However, as yet no ultimately satisfactory solution has appeared.
It will be understood by those skilled in the art that control of access to a given tape or disk file has generally been accomplished by means of a command originating in the host central processing unit. In more basic systems, the user of the computer must inform the host of the addresses on storage media at which the data necessary to complete his job is stored. Upon initiation of the job, the host then passes the appropriate instructions on to the appropriate disk or tape controllers. In more advanced systems the user of the computer may need only specify the name of his data set; the host is capable of locating the file in which the data set is stored and, for example, instructing an operator to mount a particular reel of tape, or instructing a disk controller to access a given portion of a disk drive, as necessary. Some prior art systems provide memory and intelligence external to the host for relieving it of this command translation function--see, e.g., Millard et al U.S. Pat. No. 4,096,567. However, in both schemes, it is the host which is responsible for causing the controller to access the appropriate storage medium. The controller itself is passive and merely responds to the host's commands.
The present invention is an improvement on this practice which achieves better, more efficient use of the storage space available on disk media by functionally mimicing a tape drive. Inasmuch as a tape drive need only be addressed once at the beginning of each file and thereafter records can be written sequentially thereto without being interspersed with uniquely identifiable address marks (unlike disk-stored data), the amount of data stored within a given area of tape expressed as a percentage of the total area available is extremely high. By comparison, address marks must be provided for each record stored in a given sector of each disk of a magnetic disk storage unit; the address marks consume a large proportion of the space allotted. Moreover, it had been the prior practice to allot a particular portion or "file" of a given disk unit to a given data set and not to use this area of that disk for any other data set thereafter. Unless by coincidence the capacity of the file was, in fact, equal to the size of the data set allotted to it, which is usually not the case, as users tend to expect data sets to grow and therefore set up unnecessarily large files, space is wasted. The net result is that on average disks tend to be used to something less than 50% of their capacity.
In accordance with the first embodiment of the invention, a virtual storage system is interposed between a host CPU and disk drives. The virtual storage system comprises an intelligent processor which can itself make decisions as to where on the associated disk drives data could be stored. The virtual storage system responds to commands nominally issued to tape drives by the host and converts these to commands useful for control of disk drives. In this way, the virtual storage system allows disks to functionally mimic tape in order to achieve the efficiency of addressing and formatting considerations mentioned above. The original invention thus includes the concept of a memory system external to a host CPU having intelligence for determining where on associated disk drives portions of a given data set are to be stored, (as distinguished from merely converting a data set name to host-assigned addresses, as in the Millard et al patent referred to above) and further comprises the concept of individually allotting space on disks to individual subportions of a given user data set. In this way the prior art practice of allotting a portion of a disk, a "file", to a single data set, is eliminated and the disk can accordingly be used to far higher efficiency.
In an improved embodiment, data compression and decompression means are included in the virtual storage system of the invention. Data compression is a concept which had been well known for use in host CPU's but had not previously been done external to the host. This distinction is subsumed under the fact that this was the first system external to a host to provide intelligent processor means for control of data storage functions.
In accordance with the improved embodiment of the present invention, the virtual storage system is not simply interposed between a host computer and disk drives. Instead, the virtual storage system of the present invention operates in conjunction with modifications to the operating system of the host computer. Typically this will involve some reprogramming of the host. Furthermore, the virtual storage system according to the present invention is not constrained to cause disk drives to functionally mimic tape drives or other defined storage devices, as in the first embodiment of the invention, but instead presents to the host the image of a tabula rasa; i.e., a blank sheet upon which the host, and thus the user, can write without constraint as to the format or disposal of the data. In a preferred embodiment, the virtual storage system of the invention "supports" (i.e., is adapted to store) only sequential data sets, such as those which might conventionally be stored on tape, but the invention is not so limited.
Another source of inefficiency in data storage operations is caused by the way in which systems operate according to IBM-defined protocols. For example, when a given user program causes a user file to be accessed, the host first calls for the tape or disk on which the file is stored to be mounted, if necessary. This operation can consume considerable time. The host then issues a SEEK command, causing the read/write head of a disk drive, for example, to be juxtaposed to a particular area on a disk. Thereafter, the host issues a "READ" command directed to a portion of the file. Neither the SEEK, the READ, nor the mount commands indicate whether the request is for a portion or all of a file. In the case of a sequential file, for example, there might be literally dozens of READ requests directed to varied portions of the same file. Each time there would be some delay involved in supplying the host with the subportion of the file it sought. On disk, for example, there is an access time delay caused by the time required for the particular portion of the disk sought for to rotate until it is juxtaposed to the read/write head, as well as in most cases a "latency" time required for the head to move in or out radially with respect to the disk to reach the particular track sought. In the case of tape, even if the correct record is exposed to the head, the tape drive must still be brought up to its proper speed with respect to the read/write head before a read operation can be performed.
In order to reduce these access time delays, numerous prior art systems have been suggested in which the data either following or in the vicinity of a particular subportion of a file called for by a host is read into a faster access, semi-conductor memory, referred to as a "cache", in anticipation of later host READ requests directed at the same user data set. Such solid-state or semi-conductor memory systems offer substantially instantaneous access to any given bit stored thereon, but only at high storage cost per bit, so that they are unsuited for long-term storage of data infrequently accessed. Several such systems are disclosed in the prior art. The difficulty with these systems is that as noted they are not provided by the host with any indication whether a particular READ request is of one of a series or not, so that "caching" of data in the vicinity of all data sought can produce a performance loss if, in fact, a significant percentage of the requests are not for portions of sequential data sets. Commonly assigned co-pending application Ser. No. 325,346 filed Nov. 27, 1981 in the name of P. David Dodd, relates to such a caching memory subsystem, but according to that invention means are provided whereby the memory subsystem itself is enabled to distinguish between sequential and random requests, so that only those requests which are determined to be sequential cause caching of other data. This results in a substantial performance advantage. The necessity that the Dodd system be able to distinguish between sequential and randomly accessed files is necessitated by the fact that, as noted above, no distinction is provided between sequential and random data sets, according to standard IBM protocol. It would, of course, be possible to modify the host to provide such an indication, but for the purposes of the Dodd application, this was deemed undesirable; "plug compatability" was an extremely important object of the invention.
However, according to the present invention, limited reprogramming of the host is performed. In a preferred embodiment, therefore, the virtual storage system of the present invention supports only sequential data sets, including, for example, those which might previously have been stored on tape. As in the case of the prior art caching subsystems referred to above, subportions of data sets once having been accessed by a host computer can then be cached in semi-conductor memory to provide much faster access times. The caching system of the Dodd invention, as it has no information describing the data set being accessed, in particular, whether it is sequential or not, must look at each READ request to see if it is likely to be one of a series of requests for sequentially stored data on disk. Hence, it is limited to prestaging portions of a data set stored together, in one area of a disk; any shifting to a different area on disk causes it to conclude that a particular request is not one of a series, and therefore no staging is performed. According to the present invention, only data sets known to be sequential are supported; therefore this question does not arise. According to the present invention, because the virtual tape memory system of the invention knows the locations of all the subportions of the entire record, it can continue to prestage subsequent records to the cache even though they come from varying areas on disk. Furthermore, of course, the Dodd invention does not include any means for assigning storage locations on disk to data, but only responds to conventional READ requests from the host, which include this addressing information; the memory subsystem of the present invention assigns storage locations to the various subportions of each data set.
Other systems using solid-state memory external to a host computer to reduce access times are known as well. IBM Corporation has recently announced its Models 3880-11 and -13 disk control units, which combine solid-state memory with disk memory to reduce access times. The assignee of the present invention markets a Model 4305 solid state disk, which uses solid-state memory in a system which mimics a disk drive. Neither of these systems provide intelligence external to the host computer for assigning storage locations on magnetic media to subportions of a user data set as does the system of the invention; they merely respond to host commands. Reference to these systems herein should not be deemed to imply that they are prior art against the present application.