The present invention relates to data storage systems, and more specifically, this invention relates to parallel processing of a keyed index file system for improved performance.
Virtual storage access method (VSAM) is a disk file storage access method used in IBM z/OS environments. VSAM data sets include multiple records, and the data sets are of fixed or variable length, and are organized into fixed-size blocks called Control Intervals (CIs). The CIs are then organized into larger groups referred to as Control Areas (CAs). CIs are used as units of transfer between direct access storage devices (DASDs) and requesting systems such that a read request will read one complete CI. CAs are used as units of allocation, such that when a VSAM data set is defined, an integral number of CAs will be allocated for that VSAM data set.
An integrated catalog facility (ICF) is provided on a server or mainframe which includes two components, a basic catalog structure (BCS) and a VSAM volume data set (VVDS). The BCS, sometimes referred to as a catalog generically, is typically structured as a VSAM key sequence data set (KSDS) which is an indexed VSAM organization having the most structured form of a data set, and allows for the VSAM to provide a majority of the access routines without substantial input or direction from the accessing system, besides the most rudimentary information. The BCS component is typically accessed via VSAM non-shared resource (NSR) interfaces, and includes information related to a location of user data sets and system data sets (whichever are stored to the corresponding disk, tape, or optical drive).
The VVDS is typically structured as a VSAM entry sequenced data set (ESDS) which is less structured than the VSAM KSDS. ESDSs do not contain an index component and require access routines to track the location of the records stored in the ESDS. Pointers to VVDS records in the ESDS are stored in the associated BCS records. The VSAM ESDS is accessed via both VSAM NSR and media manager interfaces, and includes information about specific attributes of user data sets and system data sets (whichever are stored to the corresponding DASD). The ICF allows for cross-system sharing of the BCS and VVDS, and is entirely responsible for sharing serialization, caching, and buffer invalidation, among other functions.
With key sequenced data sets (KSDSs), the contents consist of the users data and a unique key (specified by the user) which is used to locate specific data records in the data set. Each record in a KSDS has one unique key. Entry sequenced data sets (ESDSs) on the other hand, only contain user data, and the user provides the relative byte address (RBA) of the location of the specific data records for the VSAM to locate. The VSAM data sets containing the user data are referred to as the “base” data sets.
Typically, a keyed index file system consists of data records accessed via unique keys. In general, it is very difficult to know the existing key ranges and number of keys within each range in the data set.
Typical mainframe batch environments process data kept in keyed indexed files in a sequential manner using batch tasks. Utilizing parallel processing instead of sequential processing may dramatically reduce the batch window; however, currently, there is not an efficient method to access keyed indexed data in a parallel processing framework. Once an efficient access method is discovered, it becomes possible to employ different parallel processing frameworks (including, but not limited to, the Hadoop framework) to improve processing of large keyed indexed files.