The present invention relates to distributed database systems and more particularly to replication of data at distributed sites.
Modern data processing systems have evolved from a single, centralized mainframe to a network of independent processing systems. The efficiency of such distributed systems depends not only on the processing power of each computer site but on the ability of the system to efficiently access the information required by a user. Generally, a site can access data that resides on local storage much faster than it can access data at another site over a network connection. To take advantage of the efficiency of local access operations, some systems allow the same set of data to be maintained on multiple nodes. The mechanism for maintaining multiple copies of the same body of data at multiple sites is generally referred to as xe2x80x9cdata replication.xe2x80x9d In a distributed database system using data replication, multiple replicas of data exist in more than one database in the distributed database system.
One kind of data replication employs snapshots. A snapshot is a body of data constructed of data from one or more xe2x80x9cmasterxe2x80x9d tables, views, or even other snapshots, any of which can be stored locally or remotely relative to the snapshot. The data contained within the snapshot is defined by a query that references one or more master tables (and/or other database objects) and reflects the state of its master tables at a particular point in time. To bring the snapshot up-to-date with respect to the master tables, the snapshot is refreshed upon request, e.g. at a user""s command or automatically on a periodic, scheduled basis.
There are two basic approaches for refreshing a snapshot. xe2x80x9cComplete refreshingxe2x80x9d involves reissuing the defining query for the snapshot and replacing the previous snapshot with the results of the reissued query. xe2x80x9cIncremental refreshxe2x80x9d or xe2x80x9cfast refreshxe2x80x9d refers to identifying the changes that have happened to the master tables since the previous refresh (typically, by examining a log file of the changes) and transferring only the data for the rows in the snapshot that have been affected by the master table changes. An xe2x80x9cupdatable snapshotxe2x80x9d is a snapshot to which updates may be directly made at the snapshot site. These updates are propagated from the snapshot back to the master table before refreshing.
Traditionally, snapshots have been implemented for high-end computer systems, which are characterized by the use of high performance computers that are interconnected to one another by highly reliable and high bandwidth network links. Typically, highly experienced database administrators manage these high-end systems. Due to the expense of these high-end computers, high-end distributed systems tend to involve a small number of networked sites, whose users can be trusted at least in part because of the physical security of the computers.
FIG. 12 depicts an exemplary high-end distributed database system for a company""s sales department consisting of three sites, master site 1200, client site 1220, and client site 1240. Master site 1200, which may be located, for example, at the company""s headquarters, includes a full relational database server 1202 that is responsible for storing and retrieving data from a relational database 1204. In this example, relational database 1204 contains a customers master table 1212 and an orders master table 1214. The customers master table 1212 is illustrative of the data stored in rows for each customer of the company and includes columns for the customer number CUSTNO and the sales representative REP to whom the customer is assigned. For example, customer 13 is assigned to sales representative Smith, and customer 18 is assigned to sales representative Jones. As illustrated, orders master table 1214 holds the data stored in rows for each order that a customer has placed and includes a column ORDER that indicates the number of the order and a CUSTNO column that is correlated to a customer in the customer masters table 1212. For example, order 25 was placed by customer 13, and orders 40 and 41 were placed by customer 18.
In this high-end distributed database system, the client site 1220 is located at one sales office and client site 1240 is located at another sales office, for example in another city. Accordingly, it is desirable to have a copy of the customer and order information at the local site for the sales representatives who are located at the corresponding sales office. For example, if sales representative Smith is located at the sales office for client site 1220 or if sales representative Jones is located at the sales office for client site 1240, then it would be desirable to store the customer and order information for Smith (and other sales representatives at the same sales office) at client site 1220 and the information for Jones (and coworkers) at client site 1240.
Therefore, client site 1220, which also has a full relational database server 1222, stores snapshots of the customer master table 1212 and the order master table 1214 in local relational database 1224 as customer snapshot 1232 and order snapshot 1234, respectively. Since only some of the sales representatives are located at the sales office for the client site 1220, the customer snapshot 1232 and order snapshot 1234 only hold a subset of the data in the customer master table 1212 and the order master table 1214, respectively. In this example, the customer snapshot 1232 is shown to contain the rows for Smith""s customers and the order snapshot 1234 for the corresponding order information. All the information required to maintain and drive the refreshes for the local snapshots 1232, 1234, such as the defining queries for the snapshots 1232, 1234 and the latest refresh times, is kept in snapshot metadata 1226.
Similarly, client site 1240 also has a full relational database server 1242 and stores snapshots of the customer master table 1212 and the order master table 1214 in local relational database 1244 as customer snapshot 1252 and order snapshot 1254, respectively. Since different sales representatives are located at the sales office for the client site 1240, the customer snapshot 1252 and order snapshot 1254 maintain a different subset of the data in customer master table 1212 and order master table 1214, respectively. Shown in this example, customer snapshot 1252 contains the rows for Jones""s customers and order snapshot 1254 contains the corresponding order information. All the information required to maintain and drive the refreshes for the local snapshots 1252, 1254, such as the defining queries for the snapshots 1252, 1254 and the latest refresh times, is kept in snapshot metadata 1256.
For a more detailed description of how a snapshot is refreshed in one high-end snapshot replication environment, the reader is referred to the commonly assigned U.S. patent application Ser. No. 08/865,645, entitled xe2x80x9cFast Refresh of Snapshotsxe2x80x9d filed on May 30, 1997 by Harry Sun, Alan Downing, and Benny Souder, now U.S. Pat. No. 5,963,959 issued Oct. 4, 1999, the contents of which are incorporated by reference in their entirety herein. FIG. 13, however, is provided to briefly illustrate some of the operations involved in refreshing a snapshot in an exemplary high-end environment.
In response to one or more refresh requests, the client database server 1222 iterates through a series of doubly nested loops, first for each snapshot for which the requests were made and then for each base or master table used by the snapshot. In the doubly nested loop controlled by step 1300, the client database server 1222 sends a xe2x80x9cSet Upxe2x80x9d remote procedure call (RPC) to the master site 1200 (step 1302). When the master site 1200 receives the Set Up RPC call, the master database server 1202 performs the remotely called set up operation (step 1304). The set up operation, which is used because there can be multiple snapshots defined for the same table, processes the master log files corresponding to the master tables to set the refresh time of the most recently added changes to the master tables in the master logs to the current refresh time.
In the doubly nested loop controlled by step 1320, the client database server 1222 formulates Structured Query Language (SQL) select statements based on the local snapshot metadata 1226 and sends the SQL select statements as RPC calls to the master site 1200 (step 1322). In response, the master database server 1202 performs the SQL select statement, to select the rows in a particular master table that have changed based on entries in the master log and transmits the selected rows back to the client site 1220 (step 1324). The client database server 1222 uses the retrieved rows to update the snapshot (step 1326). In practice, however, steps 1322, 1324, and 1326 are performed two separate times, the first time to identify and delete the old rows from the snapshots 1232, 1234 and the second time to identify and get the new rows to update and/or insert in the snapshots 1232, 1234.
Finally, in the doubly nested loop controlled by step 1340, the client database server 1222 sends a xe2x80x9cWrap Upxe2x80x9d remote procedure call (RPC) to the master site 1200 (step 1342). When the master site 1200 receives the Wrap Up RPC call, the master database server 1202 performs the wrap up operation (step 1344). The wrap up operation, also used because there can be multiple snapshots defined for the same table, purges the master logs of the entries that are older than the least recently refreshed snapshot to prevent the master logs from growing unacceptably large.
Recently, there has been much interest in the marketplace for applications for front office automation. One example is sales force automation, where hundreds, if not thousands, of sales representatives in a company are given laptops or other portable computing such as palmtops to improve their productivity. The laptops are loaded with applications, for example, to help a sales representative sell the company""s products to a customer and take the customer""s order. Therefore, the laptops include a data store to keep the customer and order information handy for use by a specific sales representative.
Front office automation, however, challenges the operating assumptions behind the high-end snapshot implementations. For example, laptops are not high-performance computer systems and are only sporadically connected to a master site, typically for short periods of time. Moreover, laptops can get or stolen, raising security concerns. In addition, it is difficult to deploy a large number of front office applications with many different snapshots, particularly when the applications themselves are evolving. Therefore, implementing a high-end snapshot replication approach for front-office automation incurs a number of disadvantages that, if not addressed, render the use of snapshots problematic for front office automation.
For example, refreshing snapshots in the high-end approach is driven from the client site 1220, which is intended to help in load balancing among high-performance computers. As a result, there are many round-trips RPCs between the client site 1220 and the master site 1200. Since there is a market demand for front office applications to use, for example, 200 snapshots each with two master tables, a high-end implementation of snapshot replication would require, in this example, 400 RPCs for the Set Up operations, 400 RPCs for the SQL select statement to identify the deleted rows, 400 RPCs for the SQL select statement to identify the new or updated rows, and another 400 RPCs for the Wrap Up operations. Moreover, since each RPC for the SQL select is about one-half of a kilobyte, the two sets of RPCs for the SQL select statements together require a total of about 400 kilobytes of traffic. This overhead is always incurred despite the number of rows, if any, that have actually changed in the snapshots. Although high bandwidth T1 lines can handle this traffic at acceptable levels for many high-end implementations, the typical connection from a laptop is over a dial up line, which would take a long time and be very expensive if made over a cell phone.
In addition, high-end snapshot replication uses a full relational database system at each site to drive the snapshot refreshes, receive the row data in a SQL format, and apply the changes. Since laptops are computationally constrained, it is desirable to implement thin clients responsible for presenting a graphical user interface (GUI), for example JAVA(trademark) applications, rather than requiring a full relational database system. However, the high-end row transfer mechanism employs a thick, SQL application programming interface (API).
Replication in a front office automation environment must contend with the very real possibility that laptops get lost or stolen, for example, in airports. Although logins and passwords protect the connections between the laptop and the master site, this authentication mechanism cannot be fully trusted as secure because sales representatives often record their passwords near their laptops, for example, taped near the screen. The above-described high-end snapshot replication approach, however, relies on trusted snapshot users, granting them extensive privileges to select any table and execute any procedure in support of the snapshot refreshes being driven from the client site. If such a high-end approach is implemented for laptops, a malicious person could easily steal a sales representative""s laptop, connect to the master site using the password taped to the side of the laptop, and hack into the system, reading and destroying sensitive data.
Mass deployment of front office applications and the data to support them is another difficult issue when there are hundreds, if not thousands, of laptops functioning as client sites. Since the snapshot metadata is stored at the client site in the high-end approach, the snapshots for the front office applications have to be individually instantiated by a person at the laptop, when the laptop is connected to the master site. The typical sales representative, however, does not have the training to perform this operation. Moreover, instantiating these snapshots is especially time-consuming when done over a low bandwidth connection.
Front office applications evolve over time. As the applications change from version to version, typically new columns are added to the master tables or some columns are removed, rendering the instantiated snapshots at the laptops incompatible with the master site. Consequently, sales representatives must stop updating to their own copies of the data and bring their laptops into the master site for upgrading by an experience database administrator. This procedure is an administrative nightmare because it requires every laptop to be upgraded first.
There is a need for an implementation of snapshot replication that is suitable in a front office automation environment without incurring the above-described and other disadvantages incumbent in a high-end implement of snapshot replication.
This and other needs are addressed by the present invention in which snapshot refreshing is driven from the master site. For example, the master site stores all of the necessary snapshot metadata for performing the snapshot refreshes, so that the client site only needs to send a single refresh request. In response to the refresh request, the master site reconciles the differences between the master tables and the client""s snapshots and transmits the differences to the client site, reducing the necessary network traffic for updating a snapshot to a single round trip.
In another aspect of the invention, refresh groups are defined to group together many related snapshots that logically should be refreshed at the same time, for example, the snapshots used by a front office application suite. Thus, only a single refresh request need be sent from the client site to the master site to cause all the snapshots at the client site to be refreshed. In the above example of 200 snapshots with 2 master tables each, the 1600 round trip RPCs are replaced by a single round-trip RPC, because the master site is now responsible for refreshing every snapshot in the 200 snapshot refresh group.
With the snapshot refresh being driven at the master site, there is no longer a need for a full relational database system to be implemented at the client site. Therefore, another aspect of the invention relates to a lightweight row transfer protocol that can be easily handled, for example, by both high-end applications and thin clients, such as JAVA(trademark) applications using the Common Object Request Broker Architecture (CORBA). This lightweight row transfer protocol serves to reduce bandwidth requirements by applying compression techniques to reduce the number and sizeof column lengths in a xe2x80x9cresult setxe2x80x9d of rows that are transmitted from one computer to another.
In another aspect of the invention relating to improved security, untrusted users are granted only connect privileges and the ability to run a refresh program. The refresh program runs in the security domain of a trusted user, which first checks to see if the requesting user actually owns the snapshot. Thus, security is enhanced because knowing the password for a sales representative only gives an unauthorized user the ability to refresh the snapshot and little if nothing else.
Yet another aspect of the invention is directed to fostering mass deployment of snapshots in a refresh group by allowing a template to be defined. A template allows for a parameterized snapshot definition query or other DDL statement to be defined, so that user-specific or site-specific value can be substituted into a parameter to create different objects. Furthermore, off-line instantiation of snapshots is provided, so that the data for an entire suite of front office applications can be stored on a floppy disk, magnetic disk, CD-ROM, or other transportable computer-readable medium. This computer-readable medium is capable of being applied to a laptop, for example by insertion into a CD-ROM drive, so that an installation program can install the requisite snapshots without the intervention required of an experienced database administrator.
Still another aspect of the invention relates to upward migration of a client application and data to a new version of the application at the master site. Specifically, the related collections of master tables and other database objects at the master site can be defined as belonging to an object group. This object group can be given different xe2x80x9cflavorsxe2x80x9d that define different subsets of the objects and even different subsets of the columns in the master tables. When a laptop uploads the changes to an updatable snapshot, the current flavor of the laptop for the object group is also transmitted, so that the master site can make the necessary adjustments in the uploaded changes by dropping the values for obsolete columns and using default values for new columns.
Still other objects and advantages of the present invention will become readily apparent from the following detailed description, simply by way of illustration of the best mode contemplated of carrying out the invention. As will be realized, the invention is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the invention. Accordingly, the drawing and description are to be regarded as illustrative in nature, and not as restrictive.