1. Field of the Invention
The present invention relates in general to data processing and database management. Specifically, the present disclosure relates to real time data processing in a database management system (DBMS) and particularly in Extract, Transformation, and Load (ETL), enterprise application integration (EAI), and enterprise information integration (EII). More specifically, the present disclosure provides set-oriented data transformation based on transaction boundaries.
2. Description of the Related Art
Many types of data processing operations associated with a DBMS or through ETL are batch-oriented. The batch-mode processing is performed after all data records in a data source are read and the output is produced only after the entire batch of data records is processed. The batch-mode transformation introduces intrinsic delays in data processing. Such delays become more problematic in cascades of multiple transformations that involve multitudes of data records or rows, as the delay at one step would undermine the timeliness or responsiveness of the overall processing.
One example of batch-mode transformation can be found with aggregation, which is typically performed across all input data in a batch. In a wholesale setting, for instance, a number of ORDERS arrive in a message queue. Each ORDER is included in one message and the ORDER contains multiple LINEITEMs. To compute the sum of the dollar amount of all LINEITEMs in each ORDER, a batch-oriented system would calculate the sum across all ORDERS in the system. If an ORDER_ID uniquely identifying each ORDER is available, the sum within each ORDER could be computed by using the ORDER_ID as the group by key for the aggregation. Even if this were the case, the aggregation results may not be available until all the input data has been read. This is because batch-oriented systems typically presume that aggregation is to be performed across all input data. Similar situations will result in other multi-row transformations such as sorter, joiner, ranking, etc. The batch mode processing thus makes it nearly impossible to achieve the real-time responsiveness.
A need for data processing on a real-time basis is nonetheless conspicuous. For example, a retail store needs to know the up-to-minute summary of inventories; a postal mail service needs to provide customers with up-to-minute reports on their shipments; a financial service institution needs to continuously update its investors on the changes to their portfolios; and, a healthcare institution needs to provide access to physicians and insurance administrators the updated patient information and most current prognosis reports.
There is a further need to break the batch-mode processing and substitute therefore, the data processing based on smaller data sets. The set-oriented processing would eliminate the delay of waiting on the entire batch to be read and processed. In the above example, the aggregation results for each data set—each ORDER—may be made available immediately in a set oriented real-time data processing system.
There is a still further need for a methodology to define transaction boundaries among the input data and thereby deriving data sets based on the applicable rules or business logic for data transformation on a real-time basis.