This invention relates to hierarchical, demand/response, disk storage subsystems, and more particularly to a method and means for reducing contention among one or more direct access storage devices (DASDs) in the presence of concurrent accessing of data formatted according to one addressing convention, but formatted and stored across one or more DASDs according to a second convention.
In this specification, the acronym DASD signifies a cyclic, multitrack, direct access storage device of large disk diameter and low recording density along any track. Also, HDD is the acronym for high-density disk drives having a relatively small disk diameter with high recording density along any track and a high radial number of tracks. Lastly, the terms xe2x80x9csubsystemxe2x80x9d, xe2x80x9cstorage control unitxe2x80x9d, and the IBM 3990 SCU are used interchangeably.
Data Storage Models and Format Conversion at DASD Level
One early storage model of data was denominated CKD. CKD is an acronym for count, key, and data. This is a variable-length record formatting convention used by IBM for DASDs. This convention required a count field defininig the length in bytes of the data recorded in a variable-length data field and a key field avaible for use as a record identifier. In practice, the count field is frequently also used to provide record identification. Each of the fields as recorded was spaced apart by a gap along the DASD track. The gap was designed as a pause interval on the continuously rotating DASD, permitting the system to adjust itself to process the next field. The gaps were occasionally dissimilar in length and also served as a place for inserting metadata That is, the gap between the C and K fields differed from the gap between the K and D fields.
Each CKD-formatted record consisted of at least the fixed-length count field and a variable-length data field. The use of the key field was optional and relegated primarily to sort intensive applications. The records were stored or mapped onto a cylinder (track), head (disk), (sector) addressable group of synchronous and constant speed rotating magnetic disks.
Major operating systems such as the IBM MVS, access methods such as VSAM, and significant amounts of application programming became heavily invested with the CKD data model and the simple cylindrical, physical storage addressing of large diameter disk drives. While some records would be less than a track extent, theoretically other CKD records could span several tracks. However, the advent of virtual memory, demand paging, and page replacement operations between mainframe CPUs, such as the IBM S/370 with MVS OS, and large disk-based storage subsystems, such as the IBM 3390, tended to conform CKD records to approximate a 4-kilobyte page. Relatedly, the typical 3390 recording track could accommodate up to twelve pages or 48 Kbytes+5 Kbytes worth of gaps between the fields within a record and between records.
With the passage of time, the recording densities of disk drives substantially improved and it was economically desirable to map data recorded in one format (CKD) onto a disk drive programmed to record data in another format (fixed-block architecture or FBA). Relatedly, FBA is an acronym for fixed-block architecture. That is, a string of extrinsically formatted information is blocked into a succession of equal-length blocks. One way of ensuring recording synchronism between the formats is to have the initial count field of each new CKD record start on an FBA block boundary. In such a scheme, the last FBA block should be padded out to its block boundary.
Reference should be made to Menon, U.S. Pat. No. 5,301,304, xe2x80x9cEmulating Records in One Record Format in Another Record Formatxe2x80x9d, issued Apr. 5, 1994. Menon exemplifies the state of the art in format conversion disclosing an emulation method for rapidly accessing CKD records in which the CKD records are stored on a disk drive in FBA format.
Menon maps CKD to FBA blocks by embedding one or two indicators in the mapped information. The term xe2x80x9cmapped informationxe2x80x9d is consonant with the FBA image of the CKD track. In this regard, an xe2x80x9cindicatorxe2x80x9d is coded information of location displacement or a data attribute with respect to a CKD record being accessed on an FBA-fonnatted device. The indicators permit a general orientation and then a precise location of the head with reference to a record of interest on a given FBA DASD track measured from the index or other benchmark. Thus, when CKD records were written out to the FBA-formatted device, the indicators were placed in the stream. Consequently, when the records had to be accessed and staged for both reading and write updating, the access time or disk latency is perceptibly shortened using the indicators.
Overview of Hierarchical Demand/Response DASD Storage Subsystems
In the period spanning 1970 through 1985, IBM developed large-scale multiprogramming, multitasking computers, S/360 and S/370 running under the MVS operating system. A description of the architecture and that of the attached storage subsystem may be found in Luiz et al., U.S. Pat. No. 4,207,609, xe2x80x9cMethod and Means for Path Independent Device Reservation and Reconnection in a Multi-CPU and Shared Device Access Systemxe2x80x9d, issued Jun. 10, 1980. Such systems were of the hierarchical and demand/responsive type. That is, an application running on the CPU would initiate read and write calls to the operating system. These calls, in turn, were passed to an input/output processor or its virtual equivalent (called a channel) within the CPU. The read or write requests and related accessing information would be passed to an external storage subsystem. The subsystem would responsively give only status (availability, completion, and fault) and pass the requested data to or from the CPU.
The architecture of hierarchical demand/response storage subsystems, such as the IBM 3990/3390 Model 6 and the EMC Symmetrix 5500, is organized around a large cache with a DASD-based backing store. This means that read requests are satisfied from the cache. Where the data or records are not in the subsystem cache, the data satisfing those requests are staged up from the DASDs to the subsystem cache. Write updates result in data being sent from the CPU to the cache or to a separate nonvolatile store (NVS), or both. This is the case with the IBM 3990 Model 6. The cache-stored data is then destaged or written out to the DASDs on a batched basis asynchronous to processing the write requests. Records stored in NVS are destaged only if the modified tracks are not available in cache. In these subsystems, the term xe2x80x9cdemand/responsexe2x80x9d connotes that a new request will not be accepted from a higher echelon until the last request is satisfied by a lower echelon, and a positive indication is made by the lower to the higher echelon.
In order to minimize reprogramming costs, applications executing on a CPU (S/390) and the attendant operating system (MVS) would communicate with invariant external storage architecture even though some components may change. Relatedly, the invariant view of storage associated with an MVS operating system required that data be variable-length formatted (CKD) and stored in that CKD format on an external subsystem of attached disk drives (IBM 3390) at addresses identified by their disk drive cylinder, head, and sector location (CCHHSS). Significantly, requested CKD-formatted data is staged and destaged between the CPU and the storage subsystem as so many IBM 3390 disk drive tracks worth of information. One address modification is to use CCHHR, where R is the record number with CC and HH refers to the cylinder and head numbers, respectively.
It is well appreciated that an improved disk storage facility can be attached to a subsystem if the new facility is emulation compatible with the unit it has replaced. Thus, a RAID 5 storage array of small disk drives can be substituted for a large disk drive provided there is electrical and logical interface compatibility. Illustratively, the IBM 3990 Model 6 storage control unit can attach an IBM 9394 RAID 5 array DASD and interact with it as if it were several IBM 3390 large disk drives. Data is staged and destaged to and from the RAID 5 array formatted as CKD-formatted 3390 disk drive tracks. The RAID 5 array in turn will reformat the tracks as one or more fixed-block formatted strings and write them out to disk.
Fast Write and Quick Write
Another significant change was to separately tune the read and write paths to the subsystem-stored data to the patterns of sequential or random accessing. To this extent, the advent of inexpensive semiconductor RAM memory also encouraged the use of RAM for large subsystem buffers/caches. Also, the LRU cache discipline permitted using the caches for tuning of random read referencing. Furthermore, any loss or corruption of data in the subsystem cache could be resolved by merely restaging the CKD tracks containing the data from DASD devices.
The write path required operating the cache in a write-through manner and achieved reliability at the expense of data rate and concurrency. That is, a write operation was not deemed completed unless and until the track had been written out to the DASD backing store or device. In this regard, reference should be made to Beardsley et. al., U.S. Pat. No. 4,916,605, xe2x80x9cFast Write Operationsxe2x80x9d, issued Apr. 10, 1990. Beardsley disclosed the use of a subsystem level nonvolatile store (NVS) for buffering the results of the write update processing, thereby permitting the subsystem to signal write completion to the host and to asynchronously schedule any destaging of the updated CKD records to the DASDs.
It has been recognized that each write update operation involves (a) reading one or more records from DASD into the subsystem buffer/cache, (b) logically combining or replacing some portion of the record with the update received from the host, and (c) writing one or more modified records out to the DASD as a track overwrite. Most schemes presuppose an update in place. That is, the modified record replaces the original at the same DASD location.
There are several problems. First, in the case of CKD-formatted records, the CKD track is the unit of staging and destaging. As previously mentioned, a CKD track nominally contains up to 12 CKD-formatted 4 Kbyte records for a length including gaps of 54 Kbytes. Such a unit of staging is arbitrary, especially where high-density FBA-formatted DASD tracks can hold several CKD-formatted tracks. Second, there are many instances where only one or a few records on the same or different CKD tracks are to be updated during a write operation. Notwithstanding, the entire track containing the record is staged. This occupies significant subsystem processing resource and time.
Reference is now made to Benhase et. al., U.S. Pat. No. 5,535,372, xe2x80x9cMethod and Apparatus for Efficient Updating of CKD Data Stored on Fixed Block Architecture Devicesxe2x80x9d, issued Jul. 9, 1996. Benhase modified Beardsley""s xe2x80x9cfast writexe2x80x9d and focused upon efficiency in the use of subsystem cache and NVS resources. That is, Benhase substituted descriptors of certain types of tracks in cache as a type index rather than keeping the tracks themselves subsystem cache resident. When the host required an update write, the subsystem determined whether the requested record was of the preferred type. If so, it signaled the host that the update has been completed. It then computed a partial track containing the record or records and staged them from DASD to the subsystem cache. Otherwise, the whole track would be staged.
The descriptors in Benhase covered predefined-type tracks and those tracks which were xe2x80x9cwell behavedxe2x80x9d. Parenthetically, a xe2x80x9cwell behavedxe2x80x9d CKD track was one containing equal-length CKD records and one in which the record IDs were monotonically numbered and nondiminishing. After the track or partial track was staged to subsystem cache, it was overlaid with the changed record or records. It was then placed in the NVS for asynchronous writing out to the DASD in place. As Benhase points out, cache space is saved, fast write operations are extended to tracks not physically in cache, and records can be located without having to stage the entire track to subsystem cache.
Fixed-block Formatted RAID 5 DASD Array as a Fault-tolerant CKD DASD Reference is made to Clark et. al., U.S. Pat. No. 4,761,785, xe2x80x9cParity Spreading to Enhance Storage Accessxe2x80x9d, issued Aug. 2, 1988. Clark disclosed an array of N+1 disk drives accessed by way of a CPU acting as a subsystem storage control unit including cache and buffering. Data in the form of N+1 blocks per logical track was mapped onto the N+1 DASDs. Each logical track consisted of N fixed-length data blocks and their parity image block. The data were written to counterpart ones of the DASDs such that no single DASD contained two blocks from the same logical track, and no single DASD contained all the parity blocks. Indeed, Clark actually spread the parity images in round-robin fashion among the N+1 DASDs.
Of course. there are many ways to paint the devices with logical blocks. Suppose it was desired to write out a CKD cylinder of tracks consisting of some predetermined number of CKD tracks"" worth of records upon an IBM 3390 DASD. Further, suppose that the 3390 DASD was being emulated by a RAID 5 array formed from four high-density disk drives (HDDs). If the tracks were written out in the manner of the Clark patent, then a CKD cylinder could be mapped to the RAID 5 array of HDDs as follows:
Contemporary RAID 5 arrays include a predetermined number of spares in the event that one of the active DASDs fails. When the subsystem control unit passes a staging request, it is in the form of so many CKD tracks and there is device contention in accessing blocks and staging them to the RAID 5 cache/buffer. When a device fails, the same information in whole or in part must be recreated from fewer devices to satisfy read and update write requests as well as to write a copy of the pertinent data to the spare HDD on either a scheduled or opportunistic basis. This exacerbates device contention where, as here, the tracks of several CKD volumes are written across several HDDs.
It is an object of the invention to devise a storage subsystem method and means for staging and destaging partial tracks of variable-length formatted (CKD) records from and to devices storing the records according to a fixed-block (FBA) convention, the staging being to a subsystem cache or buffer in satisfaction of read and write update requests.
It is another object to devise a storage subsystem method and means to stage only a partial CKD track spanning CKD-requested records without staging the remainder of the CKD track where the operating mode (full track or record) is determinable from the stream of access requests.
It is a related object to devise such a method or means where a cyclic, multitracked storage device or devices comprise one or more RAID 5 arrays of high-density disk drives storing information according to an FBA convention, but emulating one or more CKD-formatted disk drives or DASDs.
It is yet another related object that such method or means be operable even where a RAID 5 array of HDDs emulating a CKD DASD is operating in a fault-degraded mode.
It was unexpectedly observed that if the messages and commands between a storage subsystem and a RAID array emulating a CKD-formatted device for both read and write operations were evaluated to ascertain whether the record addressing was random and truly in record mode, then partial track staging by the array control from the fixed-block formatted HDDs to a subsystem cache or the like would reduce device contention by reading and staging less than a full track.
More particularly, the foregoing objects are satisfied by a method and means for reducing device contention in an array of fixed-block formatted disk drives (HDDs) coupling a storage subsystem. The subsystem includes a cache and logic responsive to external commands for writing onto or reading variable-length objects to or from the HDDs. The objects are expressed as cylindrically addressable, sector-organized tracks of variable-length formatted (CKD) records. The logic also forms parity images of predetermined ones of said CKD tracks and writes both the record and image tracks on the HDDs in round-robin order until the cylinder of addresses is exhausted.
Significantly, the method and means of the invention comprise the steps of ascertaining whether any tracks and parameters specified in any of the external access commands are indicative of either a full CKD track operation, span more than a single CKD track, or are sequential operations. Next, each external command is interpreted as to whether or not they are indicative. That is, if the command is neither a full track operation, treats records spanning more than a CKD track, nor forms part of a sequential referencing process, then the CKD sector address range of the command is converted into a fixed-block address range defining a partial CKD track inclusive of the first data byte of the starting CKD sector and the last data byte of the last CKD sector. However, if the command is one of the aforementioned types, then the CKD sector address range of the command is converted into a fixed-block address range defining full CKD tracks. Lastly, the subsystem accesses the fixed blocks in the converted range from the counterpart HDDs in the array and stages the accessed blocks as either a partial or full CKD track or tracks to the subsystem.
It has frequently been the practice to first condition a subsystem and a storage device by sending preliminary or conditioning commands in which the address range of subsequent commands is set out. Illustratively, in CKD-formatted records, accessing an IBM CPU running under MVS will send a Define Extent and Locate Record CCW to the IBM 3990/3390 storage subsystem. In turn, the 3990 storage control unit will send a Set Domain message to a RAMAC array emulating one or more IBM 3390 DASDs. If the Set Domain message sent by the 3990 to the RAMAC drawer logic recites a starting and ending CKD sector and track addresses, and if the parameters in that message also show that absence of all of the following:
(1) format intent (full CKD track write operation),
(2) the access request spans more than one CKD track, and
(3) sequential operation,
then the RAMAC will convert the range, including starting and ending CKD sector addresses, into a range of FRA block addresses where the first fixed block contains at least the first data byte of the starting CKD sector and the last fixed block contains at least the last data byte of the last CKD sector and stage the partial track. Otherwise, the RAMAC stages full CKD tracks.