I. Technical Field
The present invention generally relates to the field of data processing. More particularly, embodiments of the invention relate to systems and methods for monitoring database replication.
II. Background Information
The design of a relational database is typically based on a data model. A data model is a conceptual representation of the data structures that are required by a database. The data structures include the database tables, the relationships between data tables, and the rules that govern operations on the database tables. There are two major methodologies used to create a data model: the entity-relationship approach and the object model.
A relationship is an association between two or more database tables. Relationships are expressed by the data values of the primary and foreign keys of a database table. A primary key is a column or columns in a database table whose values uniquely identify each row in a table. A foreign key is a column or columns whose values are the same as the primary key of another table. The relationship is made between two relational database tables by matching the values of the foreign key of one database table with the values of the primary key in another. Keys are fundamental to the concept of relational databases because they enable tables in the database to be related with each other.
Traditionally, database tables are organized in the following way. For each entity, a set of tables stores information about the entity. These entities are to a large extent independent and typically there are maintenance transactions for each. For example, there can be entities like ‘office’ and ‘employee’. ‘Employee’ would include several tables containing information such as, for example, address, employee ID, salary, job description, and manager. The entity ‘office’ would also comprise several tables containing information such as, for example, address, number of seats, building number, etc. ‘Employee’ could also contain a list of offices but ‘offices’ and ‘employee’ would normally be maintained separately.
The entity model is the basis for a large variety of tools and processes, such as view cluster maintenance, central master data management, and XML data interchange. With the advent of object oriented programming, the database table design centers more around objects rather than self-contained entities.
In the above example, the address that appears both in the ‘employee’ and the ‘office’ entity is modeled as a separate object. However, this does not constitute a traditional entity as it is not self-contained. The address is modeled in the database as a separate object used both by ‘employee’ and ‘office’. However, from the point of view of an application program and the user interface, the address data needs to be provided as if it were an integrated part of the ‘employee’ or ‘office’ entities.
Due to this discrepancy between the logical view of the entities and the actual incorporation of the objects, the database tables designed to follow the new object oriented model cannot be used together with a large variety of tools developed for the traditional entity based model. For example, it is not possible to compile a set of tables for XML distribution of ‘employees’ as the tables used to store the addresses belong only partly to the ‘employees’ and, furthermore, do not have the table layout expected for tables belonging to ‘employees’. There is therefore a need to bridge the gap between the traditional entity based data processing approach and the object oriented database layout that is adapted to the object oriented programming model.
In addition, it is often desirable to store copies of relational database tables at multiple sites in a distributed data processing system. Data replication is the process of maintaining multiple copies of a database table in a distributed data processing system. Performance improvements can be achieved when data replication is employed, since multiple access locations exist for the access and modification of the replicated data. For example, if multiple copies of a data object are maintained, an application can access the logically “closest” copy of the data object to improve access times and minimize network traffic. Furthermore, data replication provides greater fault tolerance in the event of a server failure, since the multiple copies of the data object effectively become online backup copies if a failure occurs.
In general, there are two types of propagation methodologies for data replication, which are referred to as “synchronous” and “asynchronous” replication. Synchronous replication is the propagation of changes to all replicas of a data object within the same transaction as the original change to a copy of that data object. For example, if a change is made to a table at a first replication site by a transaction A, that change must be replicated to the corresponding tables at all other replication sites before the completion and commitment of transaction A. Thus, synchronous replication can be considered real-time data replication. In contrast, asynchronous replication can be considered “store-and-forward” data replication, in which changes made to a copy of a data object can be propagated to other replicas of that data object at a later time. The change to the replicas of the modified data object does not have to be performed within the same transaction as the original calling transaction.
Synchronous replication typically results in more overhead than asynchronous replication. For example, more time is required to perform synchronous replication since a transaction cannot complete until all replication sites have finished performing the requested changes to the replicated data object. Moreover, a replication system that uses real-time propagation of replication data is highly dependent upon system and network availability, and mechanisms must be in place to ensure this availability. Thus, asynchronous replication is more generally favored for non-critical data replication activities. Synchronous replication is normally employed only when application requires that replicated sites remains continuously synchronized.
One approach to data replication involves the exact duplication of database schemas and data objects across all participating nodes in the replication environment. If this approach is used in a relational database system, each participating site in the replication environment has the same schema organization for the replicated database tables and database objects that it maintains. If a change is made to one replica of a database table, that same change is propagated to all corresponding database tables to maintain the consistency of the replicated data. Since the same schema organization used the replicated data across all replication sites, the instructions used to implement the changes at all sites can be identical.
Generally, two types of change instructions have been employed in data replication systems. One approach involves the propagation of changed data values to each replication site. Under this approach, the new value for particular data objects are propagated to the remote replication sites. The corresponding data objects at the remote sites are thereafter replaced with the new values. A second approach is to use procedural replication. Under this approach, a database query language statement, such as a database statement in the Structured Query Language (“SQL”), is propagated instead of actual data values. The database statement is executed at the remote sites to replicate the changes to the data at the remote replication sites. Since all replication sites typically have the same schema organization and data objects, the same database statement can be used at both the original and remote sites to replicate any changes to the data.
U.S. Pat. No. 6,615,223 shows a method for data replication that includes procedures for adding, deleting and modifying replicated data, and for replicating conflict resolution.
U.S. Pat. No. 6,058,401 shows a method for data replication with conflict detection. The method aims to reduce overhead in data replication in a distributed system capable of detecting conflicts in replicated data.
U.S. Pat. No. 5,806,074 shows a method for configurable conflict resolution in a computer implemented distributed database. The method uses a conflict detection module for detecting a conflicting modification for corresponding portions of replicated data structures.
It is a common disadvantage of known replication solutions that they are not error free. Another problem regarding data consistency in a distributed data processing system is that conflicting changes to the data can be made at different sites. There is therefore a need for methods and systems for monitoring database replication for detecting such data inconsistencies.