1. The Field of the Invention
The present invention relates to the sorting of data records. More particularly, the present invention relates to methods and apparatus for identifying and outputting specified instances of duplicate data records within a sorting utility.
2. The Relevant Art
Record sorting is a necessary and useful utility within data processing systems. Sorting facilitates the ordering of data records in a manner useful for processing applications such as searching data records, generating billing statements, creating reports, compiling directories, and the like. The ability to sort data records using one or more selected fields as a sorting key facilitates intelligent processing of data records.
While sorting is a useful operation, sorting data records and conducting operations associated therewith can consume large amounts of computational capacity, particularly when dealing with files containing a large quantity of data records.
Pipelined sorting systems have been developed to meet the performance demands of applications that process large files. Pipelined sorting systems achieve increased performance by partitioning a problem into stages, where each stage focuses processing resources on specific tasks. Typically, data records in the form of files are fed one record at a time into a processing pipeline. As each pipeline stage finishes processing a data record, the data record may be passed onto a subsequent stage for further processing.
Execution flexibility is one advantage of pipelined processing systems. The processing stages that are ready for execution may be distributed to available processors, while processing stages that are blocked may be suspended to provide processing capacity to other stages.
To increase processing efficiency, control statements associated with a set of data records may be parsed and packed into one or more control blocks prior to actual processing. The control blocks configure the various processing stages of a processing pipeline and facilitate efficient execution. Pipeline stages that are not referenced within control statements are preferably bypassed to eliminate unneeded record handling. The ability to control pipeline stages via control statements essentially provides a job-configurable virtual machine useful to developers and users of pipelined applications and utilities.
In addition to sorting, pipelined sorting systems may include features for selectively processing data records as directed by control statements associated with a set of data records. For example, IBM®'s DFSORT™ utility supports control statements corresponding to a skip stage, a user supplied input stage, a filtering stage, a stop stage, a first reformatting stage, a sort, copy, or merge stage, a second reformatting stage, a user supplied output stage, and one or more supplemental stages.
The skip stage skips or discards a selected number of the data records before passing unskipped records onto the rest of the pipeline. The user supplied input stage facilitates customized processing on the unskipped records or data records provided from programmable sources. The filtering stage filters the data records such that selected records are passed on to the remainder of the pipeline, while other records are discarded or redirected.
The stop stage passes a specified number of the data records to the remainder of the pipeline and thereby limits subsequent processing, such as sorting, to a specified number of data records. The first reformatting stage reformats data records and passes the formatted records to the sort, copy or merge stage. The second reformatting stage may also be used to apply additional formatting operations to the data records subsequent to the sort, copy or merge stage.
The user supplied output stage facilitates execution of customized processing on and data output of the processed data records. The supplemental processing stage may be used to conduct specialized formatting and reporting operations in order to generate multiple forms of output related to a set of data records.
The various stages included in the aforementioned sort utility were developed in response to the needs of developers and users. Each pipeline stage executes in an efficient manner and adds to the flexibility and power of the sort utility. The ability to draw upon the power and efficiency of the utility via control statements severely reduces the programming burden associated with creating customized applications such as generating billing statements, publishing directories, creating reports, and the like.
The files or record sets processed by a sorting utility such as the aforementioned pipelined sort utility often contain duplicate instances of data records. A system may require that only a limited subset of duplicate record instances be retained for further processing.
To address this need, sorting systems have included command statements to select all instances of duplicate data records, command statements to delete all instances of duplicate records, and command statements to select all instances of duplicate records that meet specified criteria. However, systems have been unable to directly select a particular instance of duplicate data records such as a first or last instance of a duplicate set of records. The ability to restrict processing to a particular instance would reduce the bandwidth, storage, and computational load associated with processing duplicate records.
Thus, the need exists in the art for a method, apparatus, and system capable of selecting a particular instance of duplicate data records such as a first or last instance and thereby improve the performance and effectiveness of data processing applications such as a pipelined sort utility.