The typical enterprise deploys a wide range of business systems. Such enterprises often use “data integration” technologies to pull together the information in such systems, to ensure that its decision-makers are working with a unified, current, consistent view of information from across the enterprise.
Data integration tools are often used to pool the data from these diverse systems to form “data warehouses,” from which the data may be accessed in coherent formats and compilations. The data integration tools may then be used to reorganize the data in the data warehouse to provide a plurality of “data marts,” each tailored to the needs of different classes of users in the enterprise (e.g., sales, finance, operations, human resources, etc.). Other operations in the enterprise, apart from data integration, may involve processing large volumes of data.
Data integration and other enterprise data processing often necessitates repeated operations involving “data transformation”—converting data from a source format and organization to a destination format and/or organization. Data transformation includes, for example, operations such as sorting data, aggregating the data by specified criteria, summing, averaging, or sampling the data, compressing, decompressing, encoding, decoding or otherwise manipulating the data, etc. Data transformation operations are often performed on large and sometimes enormous volumes of data, and commonly require a substantial portion of a computer's resources. Such operations can be a substantial portion of the overhead of a data integration or data processing process, and can be performance-critical. These requirements become more critical as enterprises develop larger systems and require tools to deal with what has become known as “big data”—the huge volumes of data that accumulate as a result of the automation of business processes by electronic commerce and telecommunications. Accordingly, there exists a growing need to provide more efficient tools for data transformation in order to achieve suitable performance with data integration and other large-scale applications.
Sorting is one example, which is representative in some respects, of the processing demands imposed by data transformation operations. A large sort job will often involve sorting a data set that is larger than will fit in the computer's memory at one time. Such a sort process is referred to as an “external sort,” because work-in-process data developed during the sort job must be stored outside of the main memory of the computer, resulting in additional I/O. Techniques for reducing I/O and processing requirements during an external sort are described in references such as commonly assigned U.S. Pat. No. 4,210,961 to Whitlow, et al.
Conventionally, a sort process involves, in addition to other steps, a step of reading the data to be sorted from storage into the computer's memory and sorting the data, and a step of writing the sorted data to the designated output file on a storage unit. Further, in the case of an external sort, where the amount of data to be sorted is larger than the memory can hold, the sorting step will have to be re-performed in order to sort all of the data, and one or more merge steps may be required to merge the individually sorted portions into a single run in the correct order. Each of these steps, even when optimized in accordance with the current art, entails substantial CPU activity as well as the I/O of reading and writing the input file, the intermediate merge strings, and the output file.
Commonly assigned U.S. Pat. No. 5,519,860 disclosed a method of using the processing capabilities of intelligent secondary storage attached to a main computer in order to enhance the sorting process on the main computer. Rather than reading all of the data from storage and sorting it on the main computer, the '860 patent taught to read a sort key and a record storage location for each record to be sorted; sort a “skeleton” containing only the extracted data on the main computer; and then use the sorted record locators and the sort order derived on the main computer to instruct the intelligent controller to reorder the records in sorted order. In this way, the volume of data I/O between the main computer and the intelligent controller was reduced, and the work of physically reordering the data records was offloaded to the intelligent controller, thereby increasing the efficiency of the sorting process.
Nevertheless, under the approach developed in the commonly assigned '860 patent, it remained the case that the sort operation was still carried out by the main computer, and that the data I/O, though reduced, was not eliminated, because the keys and record pointers still had to be transferred to the main computer from the intelligent controller, the main computer still had to transfer the reordering instructions back to the intelligent controller, thereby leaving considerable processing volume for the main computer and its communications channels.