1. Field of the Invention
The invention relates to the replication of data in a database system.
2. Background
Data replication is the process of maintaining multiple copies of a database object in a distributed database system. Performance improvements can be achieved when data replication is employed, since multiple access locations exist for the access and modification of the replicated data. For example, if multiple copies of a data object are maintained, an application can access the logically xe2x80x9cclosestxe2x80x9d copy of the data object to improve access times and minimize network traffic. In addition, data replication provides greater fault tolerance in the event of a server failure, since the multiple copies of the data object effectively become online backup copies if a failure occurs.
In general, there are two types of propagation methodologies for data replication, referred to as xe2x80x9csynchronousxe2x80x9d and xe2x80x9casynchronousxe2x80x9d replication. Synchronous replication is the propagation of changes to all replicas of a data object within the same transaction as the original change to a copy of that data object. For example, if a change is made to a table at a first replication site by a Transaction A, that change must be replicated to the corresponding tables at all other replication sites before the completion and commitment of Transaction A. Thus, synchronous replication can be considered real-time data replication. In contrast, asynchronous replication can be considered xe2x80x9cstore-and-forwardxe2x80x9d data replication, in which changes made to a copy of a data object can be propagated to other replicas of that data object at a later time. The change to the replicas of the modified data object does not have to be performed within the same transaction as the original calling transaction.
Synchronous replication typically results more overhead than asynchronous replication. More time is required to perform synchronous replication since a transaction cannot complete until all replication sites have finished performing the requested changes to the replicated data object. Moreover, a replication system that uses real-time propagation of replication data is highly dependent upon system and network availability, and mechanisms must be in place to ensure this availability. Thus, asynchronous replication is more generally favored for noncritical data replication activities. Synchronous replication is normally employed only when application requires that replicated sites remains continuously synchronized.
One approach to data replication involves the exact duplication of database schemas and data objects across all participating nodes in the replication environment. If this approach is used in a relational database system, each participating site in the replication environment has the same schema organization for the replicated database tables and database objects that it maintains. If a change is made to one replica of a database table, that same change is propagated to all corresponding database tables to maintain the consistency of the replicated data. Since the same schema organization used the replicated data across all replication sites, the instructions used to implement the changes at all sites can be identical.
Generally, two types of change instructions have been employed in data replication systems. One approach involves the propagation of changed data values to each replication site. Under this approach, the new value for particular data objects are propagated to the remote replication sites. The corresponding data objects at the remote sites are thereafter replaced with the new values. A second approach is to use procedural replication. Under this approach, a database query language statement, e.g., a database statement in the Structured Query Language (xe2x80x9cSQLxe2x80x9d), is propagated instead of actual data values. The database statement is executed at the remote sites to replicate the changes to the data at the remote replication sites. Since all replication sites typically have the same schema organization and data objects, the same database statement can be used at both the original and remote sites to replicate any changes to the data.
A significant drawback to these replication approaches is that they cannot be employed in a heterogeneous environment in which the remote replication sites have different, and possibly unknown, schema organizations for the replicated data. For example, consider if information located in a single database table at a first replication site is stored within two separate tables at a second replication site. The approach of only propagating changed values for a data object to a remote replication site presents great difficulties, since the data object to be changed at the first replication site may not exist in the same form at the second replication site (e.g., because the data object exists as two separate data items at the second replication site). Using procedural replication results in similar problems. Since each replication site may have a different schema organization for its data, a different database statement may have to be specifically written to make the required changes at the remote sites. Moreover, if the schema organization of the remote site is unknown, it is impossible to properly formulate a database statement to replicate the intended changes at the remote site.
Another drawback to these approaches in which database schema and objects are exactly duplicated across the replication environment is that they require greater use of synchronous replication. If a schema change is made to a database table at one site, then that change must be synchronously propagated to all other sites. This is because the basic structure of the table itself is being changed. Any further changes to that database table without first synchronously changing the underlying schema for that table could result in conflicts to the data. Moreover, synchronous replication of the schema changes could require that the replication environment be quieced during the schema change, affecting the availability of the system.
One type of database application for which data replication is particularly useful is the replication of data for directory information systems. Directory information systems provide a framework for the storage and retrieval of information that are used to identify and locate the details of individuals and organizations, such as telephone numbers, postal addresses, and email addresses.
One common directory system is a directory based on the Lightweight Directory Access Protocol (xe2x80x9cLDAPxe2x80x9d). LDAP is an object-oriented directory protocol that was developed at the University of Michigan, originally as a front end to access directory systems organized under the X.500 standard for open electronic directories (which was originally promulgated by the Comite Consultantif International de Telephone et Telegraphe xe2x80x9cCCITTxe2x80x9d in 1988). Standalone LDAP server implementations are now commonly available to store and maintain directory information. Further details of the LDAP directory protocol can be located at the LDAP-devoted website maintained by the University of Michigan at http://www.umich.edu/xcx9cdirsvcs/ldap/doc/, including the following documents (which are hereby incorporated by reference in their entirety): RFC-1777 Lightweight Directory Access Protocol; RFC-1558 A String Representation of LDAP Search Filters; RFC-1778 The String Representation of Standard Attribute Syntaxes; RFC-1779 A String Representation of Distinguished Names; RFC-1798 Connectionless LDAP; RFC-1823 The LDAP Application Program Interface; and RFC-1959 An LDAP URL Format.
LDAP directory systems are normally organized in a hierarchical structure having entries organized in the form of a tree, which is referred to as a directory information tree (xe2x80x9cDITxe2x80x9d). The DIT is often organized to reflect political, geographic, or organizational boundaries. A unique name or ID (which is commonly called a xe2x80x9cdistinguished namexe2x80x9d) identifies each LDAP entry in the DIT. An LDAP entry is a collection of one or more entry attributes. Each entry attribute has a xe2x80x9ctypexe2x80x9d and one or more xe2x80x9cvalues.xe2x80x9d Each entry belongs to a particular object class. Entries that are members of the same object class share a common composition of possible entry attribute types.
There are significant drawbacks to existing systems for performing replication of LDAP entries, objects, and attributes. Many conventional replication systems used for LDAP replication do not have robust procedures for adding or deleting replication nodes. For example, the addition or deletion of replication nodes in a conventional LDAP system often results in system downtime to implement configuration changes. Moreover, many existing systems for LDAP replication do not have robust procedures for adding, deleting, or modifying replicated data or handling replication conflicts.
Therefore, there is a need for an improved method and system for replicating data in a database system. There is further the need for a robust and efficient replication system for performing LDAP replication.
The present invention is directed to methods and mechanisms for data replication. According to an aspect of the invention, an efficient and effective replication system is disclosed using LDAP replication components. Another aspect of the invention pertains to a schema and format independent method and method for data replication. Yet another aspect of the invention relates to procedures for adding, deleting, and modifying replicated data and for replication conflict resolution. Another aspect of the invention relates to improved methods and mechanisms for adding and removing nodes from a replication system.