The present invention relates to database management systems. More specifically, the present invention pertains to a method for real time processing of a dynamically increasing computer database used in an analytic application.
Computers are used to perform a wide variety of applications in such diverse fields as finance, traditional and electronic commercial transactions, manufacturing, health care, telecommunications, etc. Most of these applications typically involve inputting or electronically receiving data, processing the data according to a computer program, then storing the results in a database, and perhaps transmitting the processed data to another application, messaging system, or user in a computer network. As computers became more powerful, faster, and more versatile, the amount of data that could be processed correspondingly increased.
Furthermore, the expanding use of xe2x80x9cmessaging systemsxe2x80x9d enhances the capacity of networks to transmit current operational data and to provide interoperability between disparate database systems. Messaging systems are computer systems that allow logical elements of diverse applications to seamlessly link with one another. Messaging systems also provide for the delivery of data across a broad range of hardware and software platforms and allow applications to interoperate across network links despite differences in underlying communications protocols, system architectures, operating systems, and databases services.
Prior Art FIG. 1 illustrates the characteristics of the various environments in which data processing can occur. The types of environments are characterized according to whether they operate on a batch basis or on a transactional basis (that is, whether data are operated on in bulk, or handled in smaller quantities such as a per transaction basis). The types of environments are also characterized according to whether the data need to be operated on in real time (e.g., essentially right away) or whether some latency in the processing can be tolerated.
Prior Art FIG. 1 shows ETL (extraction/transformation/loading) space 1, EAI (enterprise application and integration) space 2, B2B (business-to-business) space 3, and process integration space 4. ETL space 1 is characterized by large amounts of data handled in bulk, with some degree of latency occurring between the time data are received and the time processing of the data is completed. EAI space 2 is characterized by smaller amounts of data handled essentially in real time. B2B space 3 is characterized as handling larger amounts of data than that of EAI space 2 in essentially real time. However, the amount of data handled in B2B space 3 is generally not as large as that handled in ETL space 1. Process integration space 4 primarily deals with the integration of business processes handling smaller amounts of data with some degree of associated latency. Of particular interest to the discussion herein are ETL space 1 and EAI space 2.
In ETL space 1, large amounts of data exist in operational databases. The raw data found in the operational databases often exist as rows and columns of numbers and codes which, when viewed by individuals, may appear bewildering and incomprehensible. Furthermore, the scope and vastness of the raw data stored in modern databases can be overwhelming. Hence, analytic applications were developed in an effort to help interpret, analyze, and compile the data so that it may be readily and easily understood. This is accomplished by transforming (e.g., sifting, sorting, and summarizing) the raw data before it is presented for display, storage, or transmission. The transformed data are loaded into target databases in a data warehouse or data mart. Individuals can access the target databases, interpret the transformed data, and make key decisions based thereon.
An example of the type of company that would use data warehousing is an online Internet bookseller having millions of customers located worldwide whose book preferences and purchases are tracked. By processing and warehousing this data, top executives of the bookseller can access the processed data from the data warehouse, which can be use to make sophisticated analysis and key decisions on how to better serve the preferences of their customers throughout the world.
One problem generally associated with transforming data for a data mart or data warehouse is that, because of the huge amounts of data to be processed, it can take a long time to perform. For the purpose of efficient utilization of computer resources, the transformation of data is normally conducted in a xe2x80x9cbatchxe2x80x9d mode. Operational data are collected for a period of time and then extracted, transformed, and loaded into data warehouses/marts by the analytic application.
For example, sales data may be collected in the operational database for an entire week, processed by the database application in one continuous session over the weekend, and then aggregated into a target database stored in the data warehouse. The target database may reflect, for example, summary year-to-date sales by geographic region. The data warehouse storing the year-to-date sales data is updated only when all individual data accumulated for the previous week have been extracted and transformed. Between updates or even during an update session, end-users accessing the data warehouse will be presented with data from the target database current only to the previous week""s update. Data accumulating for the next session""s processing batch will not be reflected in the target database.
Thus, the batch mode of operation for processing data in ETL space 1 of Prior Art FIG. 1 can be problematic because of the latency between the time raw data are received and the time at which transformed data are ready for evaluation by end-users. The latency issue is compounded as large amounts of new operational (raw) data are frequently received for input into the data mart or data warehouse, in particular with the advent of messaging systems. However, the new data are not considered until the next time the target databases are updated.
In EAI space 2, data are more transactional in nature and thus the quantities of data requiring processing are smaller than quantities of data processed in ETL space 1. Accordingly, in EAI space 2, data can be processed essentially in real time (in essence, as the transaction occurs).
The boundaries between ETL space 1 and EAI space 2 are blurring, as end-users indicate their preference for processing large amounts of data (as in ETL space 1) with real time speed (as in EAI space 2). In addition, some applications driven from a data warehouse require constant and frequent updates of the data warehouse. To satisfy these objectives, it is becoming more common to shorten the period of time between target database updates in ETL space 1. That is, update sessions in the batch mode are run on a more frequent basis in an attempt to simulate real time processing.
However, there is a large computational cost associated with running update sessions more frequently in the batch mode. To launch a session, data transformation pipelines generally need to be established, caches and other data structures need to be built, and relevant data need to be identified, retrieved and used to prime (initialize) the data transformation pipelines and to populate the caches and other data structures. These tasks can consume a portion of the user""s time, and also they can consume a measurable portion of a computer system""s available resources. The difficulty of simulating real time processing is increased by the need to complete these tasks within a short period of time. In essence, an update session must be initiated and executed within a time window that has been specified to be small enough to simulate real time processing.
Another problem with running updates sessions more frequently is that, although in some aspects it may appear to simulate real time, in actuality processing is not occurring in real time. However, data sources (such as messaging systems) coupled to the ETL application may actually be running in real time. As such, running updates more frequently does not take full advantage of the real time capabilities of current messaging systems.
Accordingly, what is needed is a method and/or system that can process (transform) large amounts of operational (raw) data and store the transformed data in a target database (data warehouse/mart) essentially in real time, but without incurring the cost in computational resources and user time required by running update sessions more frequently, as in the prior art. The present invention provides a novel solution to this need.
The present invention provides a method and system that can process (transform) large amounts of operational (raw) data and store the transformed data in a target database (data warehouse/mart) essentially in real time, without incurring the cost in computational resources and user time required by running update sessions more frequently. The present invention solves the problem of inadequate timeliness of data stored in prior art database transformation systems by providing a method and system for incremental transformation of dynamically increasing database data sets essentially in real time.
A method and system thereof for performing real time transformations of dynamically increasing databases are described. A session, identified as a real time session, is initialized. The real time session repeatedly executes a persistent (e.g., continually running) data transport pipeline of the analytic application.
In the present embodiment, during the real time session, the data transport pipeline repeatedly extracts data from a changing database, transforms the data, and writes the transformed data to storage (e.g., a data warehouse or data mart). The data transport pipeline is executed at the end of each time interval in a plurality of contiguous time intervals occurring during the real time session.
More simply stated, in one embodiment, a latency time period is specified by a user. The real time session is essentially divided into a series of time intervals, each interval equal to the latency time period. At the end of each interval, the data transport pipeline is executed (xe2x80x9cflushedxe2x80x9d). Thus, in each interval, data are extracted from the operational data base, transformed, and loaded into a target database. The data transport pipeline remains running, even after it is executed, until the real time session is completed.
Accordingly, new data are transformed in a timely manner, and processing resources and the user""s time are not consumed by having to repeatedly re-establish (re-initialize) the data transport pipeline.