1. Field of the Invention
The present invention relates generally to data processing environments and, more particularly, to a system providing methodology for high volume, high speed adaptive data replication.
2. Background Art
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as “records” having “fields” of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about the underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of the underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level. The general construction and operation of database management systems is well known in the art. See e.g., Date, C., “An Introduction to Database Systems, Seventh Edition”, Addison Wesley, 2000.
Increasingly, businesses run mission-critical systems which store information on database management systems. Each day more and more users base their business operations on mission-critical systems which store information on server-based database systems, such as Sybase® Adaptive Server® Enterprise (ASE) (available from Sybase, Inc. of Dublin, Calif.). As a result, the operations of the business are dependent upon the availability of data stored in their databases. Because of the mission-critical nature of these systems, users of these systems need to protect themselves against loss of the data due to software or hardware problems, disasters such as floods, earthquakes, or electrical power loss, or temporary unavailability of systems resulting from the need to perform system maintenance.
One well-known approach that is used to guard against loss of critical business data maintained in a given database (the “primary database”) is to maintain one or more standby or replicate databases. A replicate database is a duplicate or mirror copy of the primary database (or a subset of the primary database) that is maintained either locally at the same site as the primary database, or remotely at a different location than the primary database. The availability of a replicate copy of the primary database enables a user (e.g., a corporation or other business) to reconstruct a copy of the database in the event of the loss, destruction, or unavailability of the primary database.
Replicate database(s) are also used to facilitate access and use of data maintained in the primary database (e.g., for decision support and other such purposes). For instance, a primary database may support a sales application and contain information regarding a company's sales transactions with its customers. The company may replicate data from the primary database to one or more replicate databases to enable users to analyze and use this data for other purposes (e.g., decision support purposes) without interfering with or increasing the workload on the primary database. The data that is replicated (or copied) to a replicate database may include all of the data of the primary database such that the replicate database is a mirror image of the primary database. Alternatively, only a subset of the data may be replicated to a given replicate database (e.g., because only a subset of the data is of interest in a particular application).
In recent years, the use of replication technologies has been increasing as users have discovered new ways of using copies of all sorts of data. Various different types of systems, ranging from electronic mail systems and document management systems to data warehouse and decision support systems, rely on replication technologies for providing broader access to data. Over the years, database replication technologies have also become available in vendor products ranging from simple desktop replication (e.g., between two personal computers) to high-capacity, multi-site backup systems.
Database replication technologies comprise a mechanism or tool for replicating (duplicating) data from a primary source or “publisher” (e.g., a primary database) to one or more “subscribers” (e.g., replicate databases). The data may also be transformed during this process of replication (e.g., into a format consistent with that of a replicate database).
In many cases, a primary database may publish (i.e., replicate) items of data to a number of different subscribers. Also, in many cases, each of these subscribers is only interested in receiving a subset of the data maintained by the primary database. In this type of environment, each of the subscribers specifies particular types or items of data (“subscribed items”) that the subscriber wants to receive from the primary database.
In performing replication, a known approach provides continuous replication, where data flows to the replicate database continuously with every log row change replicated. Transaction consistency is maintained. An alternate approach known as snapshot or ETL (extract-transform-load) replication replicates net row changes, where data flows to the replicate database in bursts (e.g., hourly, daily, etc.), and transaction consistency may not be maintained properly.
While each of these approaches may be satisfactory for certain environments, neither is satisfactory for OLTP (online transaction processing) archiving or reporting systems. Replicating every log row change continuously tends to be too slow and bogs down the reporting system operations. The snapshot approach does not provide transaction consistency or up-to-date data.
One approach to dealing with the replication needs of an OLTP reporting system is provided in “IQ real time loading” solution for the Sybase IQ product from Sybase, Inc. of Dublin, Calif. As its name suggests, the approach provides more real-time loading of replication data for an OLTP reporting system. However, there are usability and manageability issues with the approach resulting from the need to use an external staging database, as well as data modeling toolsets, and external refresh processes, along with requiring modifications by the system administrators for multiple tables and function strings.
Besides usability and manageability issues, there are also performance issues. For example, the throughput for the replication is limited by what can be replicated to the staging database. Also, during a data refresh process, the data server interface (DSI) operation has to be suspended, data has to be populated from the staging database to the reporting system, and the staging database has to be cleaned up before DSI can be resumed. This hinders performance, and while there are alternatives to potentially parallelize the DSI and refresh process, they add more complexity to this already complicated solution.
Accordingly, a need exists for an approach to data replication for OLTP reporting systems that achieves better performance and provides transactional consistency, with continuous data flow without requiring an external staging database or the need to suspend replication. The present invention addresses such a need.