1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular to the optimized data replication technique utilizing incrementally maintained base aggregations of a dynamically increasing computer database, usable in data warehousing scenarios.
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. RDBMS software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Organization (ANSI) and the International Standards Organization (ISO).
In RDBMS software all data is externally structured into tables. The SQL interface allows users to formulate relational operations on the tables either interactively, in batch files, or embedded in host language, such as C, COBOL, etc. Operators are provided in SQL that allow the user to manipulate the data, wherein each operator operates on either one or two tables and produces a new table as a result. The power of SQL lies on its ability to link information from multiple tables or views together, to perform complex sets of procedures with a single statement.
One of the most common SQL queries executed by RDBMS software is the SELECT statement. In the SQL standard, the SELECT statement generally has the format: xe2x80x9cSELECT  less than clause greater than  FROM  less than clause greater than  WHERE  less than clause greater than  GROUP BY  less than clause greater than  HAVING  less than clause greater than  ORDER BY  less than clause greater than .xe2x80x9d The clauses generally must follow this sequence. Only the SELECT and FROM clauses are required and all other clauses are optional.
Generally, the result of a SELECT statement is a subset of data retrieved by the RDBMS software from one or more existing tables stored in the relational database, wherein the FROM clause identifies the name of the table or tables from which data is being selected. The subset of data is treated as a new table, termed the result table. The WHERE clause determines which rows should be returned in the result table. Generally, the WHERE clause contains a search condition that must be satisfied by each row returned in the result table. The rows that meet the search condition form an intermediate set, which is then processed further according to specifications in the SELECT clause. The search condition typically comprises one or more predicates, each of which specifies a comparison between two values from certain columns, constants or correlated values. Multiple predicates in the WHERE clause are typically connected by Boolean operators.
The SELECT statement may also include a grouping function indicated by the GROUP BY clause. The GROUP BY clause causes the rows in the intermediate result set to be grouped according to the values specified in the clause. A number of column or aggregate functions are also built into SQL, such as MAX (maximum value in column), MIN (minimum value in column), AVG (average of values in column), SUM (sum of values in column), and COUNT (number of rows).
Queries using aggregate expressions return as many result rows as there exist unique xe2x80x9cresult groupsxe2x80x9d in the source of the aggregation. A result group is defined as the set of values contained in the fields of a row corresponding to the list of columns or expressions specified in the GROUP BY clause of the query. The value of the aggregate function is the cumulative result of applying the function to the result of the contained expression, for each row having the same result group.
A data warehouse is a subject-oriented, integrated, non-volatile, time-variant collection of data suited to the decision support needed at a certain network location of the enterprise data environment. For this purpose the IBM company created several replication products. One such product is named DataPropagator Relational Version 5.1, useable for AS/400 systems, having such features as efficient architecture for automatic capture and asynchronous propagation of data changes to DB2 databases, applicable for building data warehouses and creating client/server, distributed, or mobile applications. This product provides for automatic maintenance of consistent copies of relational data in the databases, and has a change-capture component that captures all application table changes. It utilizes subscription sets for transaction consistency, and supports full refresh and update change replication, update-anywhere replication, DB2 views-based replication, event-driven and continuous replication. Replication can help decrease batch workloads by replicating data in the background throughout the day.
Businesses today benefit from replicating data across a range of applications and business sites, to improve business cycle times and customer responsiveness. Frequently, these applications must share data with legacy applications on the host. Data replication can automatically deliver the shared data to the target platform, improving data availability and data access performance, and accommodating data restructuring and minimizing network load. This can improve employees"" decision making capabilities. The decision-support databases assist in day-to-day decision-making activities, from determining what items to stock in various stores, to identifying customer sets for new products.
Data replication can improve application deployment and the existing application backlog can be reduced, since the majority of these applications are based on the relational model and use standard SQL. Data replication enables use of these applications by mapping, reformatting, and delivering data from legacy environments to relational databases elsewhere. In order to increase online throughput, replication supports off-load query processing to make room for increasing transaction processing requirements. Off-loading query processing reduces contention that impacts online transaction processing (OLTP) response time. Businesses are migrating applications from legacy systems or replicating data between multivendor environments. This reduces application development time and reduces application maintenance costs.
DataPropagator Relational, 5.1 provides read-only, update-anywhere, and on-demand replication between relational source and target processors. It consists of the following autonomous components and programs. Administration and replication control is performed by Control Center, Capture utility is used to capture changes made to data on replication sources, and Apply utility reads previously captured changed data and applies it to target tables.
The Control Center is used to define tables as sources, called replication sources, define target table definitions, called replication subscriptions, clone replication subscriptions to other servers and remove replication sources or subscriptions no longer needed. Whenever a replication request from the Control Center is submitted, such as a command defining a replication source, the processing information is generated as SQL statements. The statements may be run immediately, or saved in a file, which can be edited and the statements can be run at a later time from a replication folder object in the Control Center. Deferred SQL files let a user customize the replication tasks for his application needs.
The Capture component captures changes made to data in tables defined as replication sources by reading the database transaction log or journal, without making any changes to the sources, and is performed asynchronously to business applications using the same replication sources. The captured changes are placed in staging tables, with transaction detail stored separately in a Unit of Work (UOW) table.
The Apply component reads the changed data, previously captured and stored in a staging table, and applies it to the target tables. Apply components can also read data directly from source tables, for example, for a full refresh. Supporting update and refresh copying provides greater flexibility and automation in a replication environment. The Apply component also massages the data to user specifications, as it copies data to the targets. SQL statements can be prepared to create new columns, summarize data, translate data, join data and do other data transactions.
The Apply component allows the user to create read-only copies, user copy tables, which represent source data at a particular point in time, point-in-time tables, which represent source data at a particular point in time and some overhead columns, history tables, staging tables (which can be used as a source for further copies without recapturing changes, thus supporting a consistent data environment and providing flexibility in data distribution across the network), updatable copies, and replica tables. Updates to a replica table are automatically applied to the original source table of the replica, provided no conflicts are detected.
The Apply component, running at the replica site, detects update conflicts after they occur during the subscription cycle. An Apply instance can process many subscriptions at a single site. Multiple Apply instances can run concurrently at different sites/platforms, each processing different numbers and types of subscriptions from the same source. Each subscription can have different definitions, refresh criteria, and timing.
The performance of aggregate or grouping functions is a key factor in any database system performance, since in many applications more than 70% of the queries perform these functions. Thus, improvements in the performance of aggregation or grouping instructions lead directly to significant improvements in application workloads.
One problem associated with data warehouses involves aggregating the extremely large amounts of data stored within application databases with the frequent input of large amounts of data. By aggregating the data stored within the data warehouse with newly inputted data, the raw data is translated into the most current meaningful information that can be relied upon by decision makers, which should have the most up-to-date and current information. However, the aggregation takes a long time to perform and overwhelms the application processor, because in the conventional systems all the pertinent data stored within the data warehouse is re-aggregated along with the newly inputted data.
While there have been various techniques developed for optimizing the performance of SQL statements, there is a need in the art for optimization techniques involving aggregate expressions and grouping clauses. Therefore, it would be advantageous to provide a method for the computer that accesses data warehouse databases to be able to easily aggregate sets of data stored within them. The present invention provides this advantage, by aggregating only the newly inputted and changed data into the aggregate table. The aggregate table thus has the most current information without having to continuously access and re-aggregate all the pertinent data, stored within the data warehouse, along with the newly inputted data.
The foregoing and other objects, features, and advantages of the present invention will be apparent from the following detailed description of the preferred embodiments, which makes reference to several drawing figures.
One preferred embodiment of the present invention includes a method of optimizing a query in a computer system, in which data is retrieved from a dynamically increasing relational database stored in an electronic storage device coupled to the computer system. The method includes the steps of: analyzing the query in the computer system to determine whether the query includes an aggregate operation; and performing aggregate operations in the computer system, by incrementally aggregating only the newly inputted relational database data into a base aggregates table, thereby updating the base aggregates table with the most current information without having to re-aggregate the relational database.
The aggregation operation may have one or more functions selected from AVERAGE (AVG), COUNT and SUM functions. The query may have at least one arithmetically adjustable SQL column function, such as AVERAGE (AVG), COUNT and SUM, and other columns which are referenced in a GROUP BY clause.
The step of incrementally performing aggregate operation includes initializing the base aggregates table by executing the query against the relational database. Then, the newly inputted relational database data may be aggregated into an intermediate table, and the base aggregates table may be re-aggregated using the intermediate table data.
The present invention may be performed in a network-based computer system, having at least a source site and a target site. In that case the step of incrementally performing aggregate operations is performed at the target site. Moreover, the source site has the relational database, and the target site has the base aggregates table, and may also have the intermediate table, if used.
The base aggregates table may include a row birth timestamp, set when a row is inserted into the base aggregate table, and an end timestamp, either having a commit time of the last transaction modifying the base aggregate table or a commit time of the last transaction modifying the base aggregate table row.
Another preferred embodiment of the present invention is a computer system for optimizing a query, utilizing the above-described method embodiment of the present invention.
Yet another preferred embodiment of the present invention is a program storage device readable by a computer tangibly embodying a program of instructions executable by the computer to perform the above-mentioned method embodiment of the present invention.