Data processing centers are often burdened with having to perform various time-consuming computer processing operations in order to maintain data availability in the event of equipment failures or natural disasters.
For example, dump/restore operations back up large amounts of data stored on direct access storage devices (DASD) to tape devices. Operations to off-load data to tape are typically performed at off-prime shift hours because of the time requirement for the operation.
Data Mirroring is another method used for maintaining high data availability where a second copy of updated data is automatically copied to a backup DASD system. Many installations use on-the-fly creation of backup copies for critical databases during the prime shift operations. The backup copies can advantageously be located physically removed from the primary storage device. This process is referred to as Extended Distance Dual Copy. With dual copy, as with the dump/restore operations, the time required to copy the data from the primary storage media to the backup volume can be critical. There is a need to minimize the time required for these operations.
A control unit, also referred to as a storage subsystem or storage unit, includes a controller and is connected to one or more storage devices, such as disk files or tape drives as well as to a host central processing unit (CPU) system. An example of a disk control unit is the IBM 3990 Series controller. The disk files are also referred to as Head Disk Assemblies (HDAs). The HDAs contain the actual data storage hardware. The controller provides the external computer interface for the subsystem. Each HDA contains one or more platters or disks on which data is recorded. The data is written in concentric circles on the disks, which are called tracks. The user data can be written and read from the host computer issuing commands. The HDA and storage subsystem may be packaged together or separately.
The data on the tracks are organized according to a set of rules which are typically fixed in the design of the disk system. For example, the design of the disk system may require that the data be written in fixed length record or allow for variable length records. A well-known technique for writing and reading variable length records is referred to as Count Key Data (CKD) format. The tracks are grouped to form cylinders.
The CKD format is used on many computers such as the IBM System/390 CPUs and attached external storage devices. The CKD format operates under many operating systems such as the well known IBM MVS operating environment. A CKD record consists of count, key and data fields. The count field defines the location of the record and the length of the key and data fields. The key field serves as a record identifier, when used. The data field contains the actual data stored in the record. A track can store zero, one or many records of various lengths.
In general, data records are transmitted between host and storage subsystems over communication attachment interface architectures such as IBM ESCON, IBM OEMI protocol channels, or Small Computer Systems Interface (SCSI). Peer-to-peer or channel-to-channel communication links well known in the field, such as the channel-to-channel interface on the IBM System/370 or SCSI interface, allow data to be transmitted between storage subsystems without the intervention of the host system.
The communication channels have inherent limitations on the rate at which data can be transmitted. Compressing the data decreases the size of the data stream being transferred which increases the band width of the data transmission between the host and storage subsystems. For a 3:1 compression ratio, data can be transmitted three times as fast over the same communication link. Control units can compress data before transmitting the data to the external storage device in order to transmit data quicker and to store more data on the device.
In Extended Distance Dual Copy, where data is sent from a primary control unit to a remote control unit for storage on a remote DASD, the data is typically stored on the primary DASD in a compressed form. The data is decompressed by a compressor before it is sent over a data channel to the remote control unit. The controller for the remote DASD has a compressor which again compresses the data before it is stored on the remote DASD.
In the dump/restore function, the data that was stored in a compressed form on the DASD is decompressed by the compressor and sent over the data channel to a remote control unit where it is again compressed by a compressor to be stored on the tape.
In an example where there is a three-to-one compression ratio, a 4.5 kilobyte record is compressed to 1.5 kilobytes. The difference between sending across 4.5 Kb versus 1.5 Kb can make a major difference in terms of the number of hours of data transmissions being sent. If the channel provides a transmission rate of 20 megabyte per second, a 4.5 Kb record of data would take 225 microseconds, whereas reducing the data size to 1.5 Kb would take 75 microseconds to transfer the data.
A compression algorithm is applied against a data record received from the host. If the compression algorithm results in a smaller sized record than the original record, the record is said to be compressed. If the compression algorithm results in a larger sized record then the original record, the record is said to be expanded. The preferred implementation is to discard the expanded records and use the original record. See co-pending commonly assigned patent application to Carreiro et al. which carries U.S. Ser. No. 08/322,441, filed Oct. 4, 1994, entitled Storage Management of Data Expansion Transparent to Host for a description of how the original record is used.
Storage subsystems do not currently utilize an efficient scheme for transmitting data streams of compressed and non-compressed data between the control units. The control unit needs to be able to identify whether the data it is receiving has already been compressed.
Compression of data streams is particularly significant for variable length records such as the IBM CKD format. However, other record formats can also benefit from compression.
A data stream which contains meta-data is defined to be a Composite Data Stream (CDS). Meta-data is used to describe records where compression may have been applied. A data stream which contains no meta-data generated by a storage unit is defined as an Original Data Stream (ODS). A data stream originating from a host system can also be referred to as an original data stream. When a CDS contains compressed records, it can be transferred in less time than its counterpart larger ODS. When there are both compressed and non-compressed data within the same data stream, transferring the data stream becomes more complicated. There is a need for the control unit to identify which records are compressed and a need to be able to de-compress records that have been compressed. There are special problems in identifying and sharing data streams which contain both compressed and non-compressed data records.
There is a need to be able to improve the performance when transferring data from a primary storage device to secondary storage devices by enabling data to be transmitted in a compressed form between control units.