1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular to the optimized remote log based data replication technique wherein Capture and Apply utilities are co-located on a target system, thereby eliminating communications overhead and providing an improved setting for building data warehouses and many other replication solutions.
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. RDBMS software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Organization (ANSI) and the International Standards Organization (ISO).
In RDBMS software all data is externally structured into tables. The SQL interface allows users to formulate relational operations on the tables either interactively, in batch files, or embedded in host language, such as C, COBOL, etc. Operators are provided in SQL that allow the user to manipulate the data, wherein each operator operates on either one or two tables and produces a new table as a result. The power of SQL lies on its ability to link information from multiple tables or views together, to perform complex sets of procedures with a single statement.
A data warehouse is a subject-oriented, integrated, non-volatile, time-variant collection of data suited to the decision support needed at a certain network location of the enterprise data environment. For this purpose the IBM company created several replication products. One such product is named DataPropagator Relational Version 5.1, useable for AS/400 systems, having such features as efficient architecture for automatic Capture and asynchronous propagation of data changes to DB2 databases, applicable for building data warehouses and creating client/server, distributed, or mobile applications. This product provides for automatic maintenance of consistent copies of relational data in the databases, and has a change-Capture component that Captures all application table changes. It utilizes subscription sets for transaction consistency, and supports full refresh and update change replication, update-anywhere replication, DB2 views-based replication, event-driven and continuous replication. Replication can help decrease batch workloads by replicating data in the background throughout the day.
Businesses today benefit from replicating data across a range of applications and business sites, to improve business cycle times and customer responsiveness. Frequently, these applications must share data with legacy applications on the host. Data replication can automatically deliver the shared data to the target platform, improving data availability and data access performance, and accommodating data restructuring and minimizing network load. This can improve employees"" decision making capabilities. The decision-support databases assist in day-to-day decision-making activities, from determining what items to stock in various stores, to identifying customer sets for new products.
Data replication can improve application deployment and the existing application backlog can be reduced, since the majority of these applications are based on the relational model and use standard SQL. Data replication enables use of these applications by mapping, reformatting, and delivering data from legacy environments to relational databases elsewhere. In order to increase online throughput, replication supports off-load query processing to make room for increasing transaction processing requirements. Off-loading query processing reduces contention that impacts online transaction processing (OLTP) response time. Businesses are migrating applications from legacy systems or replicating data between multivendor environments. This reduces application development time and reduces application maintenance costs.
DataPropagator Relational 5.1 provides read-only, update-anywhere, and on-demand replication between relational source and target processors. It consists of the following autonomous components and programs. Administration and replication control is performed by Control Center, Capture utility is used to Capture changes made to data on replication sources, and Apply utility reads previously Captured changed data and applies it to target tables. The use of Capture and Apply utilities is known in the art, as described, for example, in U.S. Pat. No. 5,995,980, incorporated herein by reference.
The Control Center is used to define tables as sources, called replication sources, define target table definitions, called replication subscriptions, clone replication subscriptions to other servers and remove replication sources or subscriptions no longer needed. Whenever a replication request from the Control Center is submitted, such as a command defining a replication source, the processing information is generated as SQL statements. The statements may be run immediately, or saved in a file, which can be edited and the statements can be run at a later time from a replication folder object in the Control Center. Deferred SQL files let a user customize the replication tasks for his application needs.
In a conventional system, the Capture component, located in a source computer, captures changes made to data in tables defined as replication sources by reading the database transaction log or journal, without making any changes to the sources. It is performed asynchronously to business applications using the same replication sources. The captured changes are placed in a staging area Change Data (CD) table, with transaction detail stored separately in a Unit of Work (UOW) table, both located in the source computer.
Ordinarily, the Apply component, located in a target computer, reads the changed data previously captured and stored in the staging area table and applies it to the target tables on the target computer. Apply components can also read data directly from the source tables, for example, for a full refresh. Supporting update and refresh copying provides greater flexibility and automation in a replication environment. The Apply component also massages the data to user specifications, as it copies data to the targets. SQL statements can be prepared to create new columns, summarize data, translate data, join data and do other data transactions.
The conventional Apply component allows the user to create read-only copies, user copy tables, which represent source data at a particular point in time, point-in-time tables, which represent source data at a particular point in time and some overhead columns, history tables, staging area tables (which can be used as a source for further copies without recapturing changes, thus supporting a consistent data environment and providing flexibility in data distribution across the network), updatable copies, and replica tables. Updates to a replica table are automatically applied to the original source table of the replica, provided no conflicts are detected.
The Apply component, running at the replica site, detects update conflicts after they occur during the subscription cycle. An Apply instance can process many subscriptions at a single site. Multiple Apply instances can run concurrently at different sites/platforms, each processing different numbers and types of subscriptions from the same source. Each subscription can have different definitions, refresh criteria, and timing.
One problem associated with data warehouses involves aggregating the extremely large amounts of data stored within application databases with the frequent input of large amounts of data. By aggregating the data stored within the data warehouse with newly inputted data, the raw data is translated into the most current meaningful information that can be relied upon by decision makers, which should have the most up-to-date and current information. However, the aggregation takes a long time to perform and overwhelms the application processor, because in the conventional systems all the pertinent data stored within the data warehouse is re-aggregated along with the newly inputted data.
The conventional replication products, such as IBM Data Propagator, MIMX, DATAMIRROR and others, are using a client/server approach. There, a Capture component is running on the source system to capture changes, and an Apply component runs on the target system to apply changes. Thus, when replicating from one system to another, two distinct components are running on each of the systems, using some amount of disk space, CPU time and other management overhead. This type of data replication technique is shown in FIG. 1.
In this conventional replication system, the Capture component captures base table changes from a journal, the changes are put into an internal staging area (xe2x80x9cChange Dataxe2x80x9d and xe2x80x9cUnit of Workxe2x80x9d table). Then, the Apply component from the remote system picks up changes from the staging area tables on the source site, thus replicating data from one system into another.
In contrast with some other log-base platforms, such as AIX, MVS and NT, the AS/400 system uses Journals as the vehicle for tracking database changes. In most cases, a named Journal will be created in an AS/400 library and work with one Journal Receiver at a time. Thus, a Journal can be viewed as a controller module of a Journal Receiver, which is a table. When operations (such as SQL insert, update, delete and other native AS/400 operations) are applied to database objects, the changes, known as Journal Entries in AS/400, on the database objects are sent to a Journal Receiver.
While there have been various techniques developed for optimizing the performance of SQL statements, there is a need in the art for optimization techniques involving data replication. Therefore, it would be advantageous to provide a method for the computer that accesses a source site database to be able to quickly replicate data on a target computer site. A new approach utilizing remote journal support, therefore, allows the user to offload the CPU usage and direct access storage device (DASD) consumption away from the main production system and directly to the operating system where the final destination of the data propagation is. The present invention provides this advantage by placing the Capture component and the tables with newly inputted and changed data on the same, target site. This method eliminates network overhead, and allows for more efficient and better CPU distribution control, ideal for building data warehouses and many other replication solutions.
One preferred embodiment of the present invention includes a method of optimizing a query in a computer system, in which data is retrieved from a relational database stored in an electronic storage device coupled to the computer system. The method utilizes a remote journaling solution wherein both Capture and Apply utilities are co-located on the target system. Unit of Work (UOW) and Change of Data (CD) tables also exist on the target system, and together serve as a staging area. The combination of these components on the target system eliminates the conventional Apply utility""s network management responsibility, substantially reduces communications overhead, and reduces CPU and DASD consumption for significantly increased efficiency.
Another preferred embodiment of the present invention is a computer system for optimizing a query, utilizing the above-described method embodiment of the present invention.
Yet another preferred embodiment of the present invention is a program storage device readable by a computer tangibly embodying a program of instructions executable by the computer to perform the above-mentioned method embodiment of the present invention.