1. Field of the Invention
The present invention relates to data replication and changed data propagation in distributed or enterprise computer systems, and particularly to a method for moving data and/or propagating changed data between similar and dissimilar environments with improved efficiency.
2. Description of the Related Art
Databases support the majority of business-critical applications in most major organizations. With the increasing complexity caused by mixed mainframe and client/server DBMS environments, users and DBMS are looking for better ways to move corporate data from centralized systems while maintaining centralized control. This process also moves data to areas in which it can be used for additional functions such as decision support or creating a local copy of a corporate application. Much of the data is operational, some of it is historical, and it is rarely accumulated or stored in the same place. Also, applications are distributed, data of different types is distributed on multiple DBMS platforms, and users need the data available in different subsets, different formats, and spanning different time frames. As a result, data is typically available and useful to only a fraction of the people who need it. To meet business objectives, data needs to be moved and transformed. Once moved, the data needs to be kept up to date.
Current data extraction, transformation and movement methods are generally very cumbersome. Most companies today write their own in-house routines for the above operations. While it is relatively easy to move data between identical environments, such as DB2 to DB2, it is far more complex and error prone to move data from centralized systems into multi-platform client/server environments. However, business requirements dictate that a given set of data must be available to and usable by employees with a variety of different roles within the organization.
In addition, in-house developed data extraction and movement routines are generally one-of-a-kind, highly customized to fit a specific organization, and not generally portable or adaptable to changing business requirements--a real liability in fast-paced climates where business conditions change continuously. Perhaps most damaging, these processes are inherently reactive. Database administrators (DBAs) are generally busy with various functions and thus are generally forced to simply react to change and cannot take a proactive approach. Because there is rarely time to take a proactive approach, important functions like database optimization are subordinated to task performance.
In summary, DBAs need a better set of tools and more efficient methods to replicate data across the enterprise, while users need faster ways to access data in centralized databases at the desktop where it is needed.
Data Movement Issues PA0 Current Solutions
Database Administrators (DBAs) face issues in trying to moving data throughout the enterprise to where it is needed. As large computing enterprises evolve ever more complex ways of acquiring and handling business-critical data, new and equally complex issues arise concerning how to transform that data into information that can be used by all facets of the organization. This section summarizes those issues and is followed by a discussion of the pros and cons of current, conventional solutions designed to address.each.
A major issue facing most DBAs is distributing data to where it is needed. While data accumulates in several areas, usually specific to its application and increasingly tied to a specific RDBMS, organizations are finding it increasingly difficult to deliver that data into the right hands. For example, data that was entered for the purpose of cost accounting may also be equally valuable for regional sales forecasting. However, this data can not be used for decision support when stored in an operational location.
One solution is to allow access of the various employees and departments to a single database. However, the primary constraints are technical. Decision support queries tend to be complex, CPU and I/O intensive, and difficult to optimize because of their ad hoc nature. Such queries can overwhelm, for example, an order entry system and create an unacceptable disruption of the most basic business function: taking requests from customers and shipping merchandise to generate revenue. Therefore, in order to counteract the drain on I/O and CPU resources, the data is required to be moved to a separate location where complex queries cannot affect normal business activity.
To make data useful beyond a narrow business function, DBAs are required to replicate and transform data or data subsets to support distributed applications. The issues involved with data movement or replication include timeliness and synchronization of application data (origin and target need to have the same information); physical separation of data due to distributed systems; and the business requirements that led to distributing applications and data in the first place. In general, timeliness of data is key, as is the ability to move the data from centralized storage to other locations and the ability to transform the data into formats useful by a variety of desktop systems.
Source and target databases are typically very different, with the differences being in physical location, platform and data structure. The source database typically resides on a mainframe computer system. Because mainframes are incredibly expensive, users do not have them on their desks, which is why distributed systems were designed in the first place. In a typical enterprise system, operational and historical data resides on the mainframe, DBMS applications are lodged on UNIX servers, and desktop PCs are used to view the information. Data users may also need data that is housed in different DBMS environments and viewed through different desktop database applications.
During data movement or replication operations, data will need to be transformed into a variety of formats. This transformation of data is necessary to enable the data to become useful to a variety of people in a given organization and/or to accommodate different target DBMS environments. The more RDBMS environments, the more hardware, operating system and DBMS platforms present, the more uses for data, the more complex and continuous the task of data transformation becomes. At a minimum, the process requires data type conversion. In addition, if the information is to be used for decision support, the data often requires "scrubbing," or redefinition. In general, the more targets that exist within an organization--including PCs, Macintoshes and, in some cases UNIX workstations--the more varieties of data transformation are required. For example, a company with three divisions may have three different ways of representing revenue--as 10-place characters; as integers; and as decimal fields. To file a quarterly report, the company needs to develop a single way to reconcile and represent that data.
The operational cost of moving data is significant. DBAs today have limited windows of opportunity in which to perform a host of critical operations such as backups, performance optimizations, application development or tuning, and change management. In a 24.times.7 world, these vital and basic operations already consume more time than users or management would consider ideal. DBAs risk user revolt if they propose to bring down the database to move data around the enterprise, no matter how important such an operation may be.
Therefore, DBAs require tools and utilities that allow them perform moves and transformations without incurring additional administrative overhead, and to take advantage of the limited time they have for prescheduled maintenance, such as data unloads for reorganizations.
Current data movement solutions are more accurately characterized as quick fixes and partial remedies. The most common methods that in-house developers and database vendors are offering to help DBAs move and transform data in enterprise environments include customized code, customized tools, and tools from database vendors and third parties.
Customized code is typically written in-house and is specific to a single application or DBMS environment. On the positive side, such solutions are generally economical, since such routines are geared toward providing exactly what is needed and no more, and address requirements for which there are no off-the-shelf products. In-house development, testing and debugging also narrows the focus, and tends to produce a workable, if non-versatile, solution. On the other hand, such customized routines require that programmers have extensive knowledge of how the business works, since each move and transformation must coincide with business objectives and processes. Because these routines are usually specific to a source or target database, they are difficult to port to other environments. These routines are also difficult to repeat because the routines are unique to each situation and because there is no infrastructure in place to manage the processes. Finally, building custom routines robs in-house DBAs of time better spent on their core jobs: database design, maintenance and optimization.
Consultants and customized tools are also used by businesses with increasing frequency today. Outside consultants typically have acquired extensive experience in building data models, designing movement and transformation methodologies and developing conversion tools. Such tools tend to be more portable, since they have been developed with multi-platform DBMS environments in mind. Because database consultants have had to become knowledgeable about business operations as well, these tools also tend to address business processes adequately. On the negative side, all application expertise leaves along with the consultant. In addition, because these routines are specific to single aspects of the business, they are difficult to recreate for other branches or divisions.
Tools from database vendors and third parties are also sometimes used. These tools offer a mix of copy management and data extraction/transformation capabilities. Database vendor tools are pre-packaged routines, and thus there is less code to debug. Also, in an environment where a single DBMS runs all business functions, tools built by the respective database vendor provide an acceptable solution. On the other hand, database vendor tools tend to be driven more by replication processes than by business issues. As a result, DBAs are often required to write specific code to address those business issues since the tools themselves do not address or solve these problems.
Pre-packaged tools create an infrastructure capable of handling processes. However, pre-packaged tools can replicate errors and magnify small mistakes because they do not deal with data models and business rules and because they do not enforce rigid meta data management standards. Also, these tools often do not scale well. Most are geared to generating bulk copies of the entire database and cannot divide the database into smaller increments. As the database grows, such an operation takes more and more time; in fact, it is possible to reach the absurd point at which daily data can't be loaded in 24 hours. Finally, such tools typically focus on only part of the replication process, and aren't geared to solving other constraints such as bandwidth limitations.
Therefore, to summarize, in Enterprise computer systems, the processing and storage components are distributed geographically and interconnected by means of communication networks. Data is often distributed among the components and stored in relational databases. In large enterprises, each computer in the network will likely need to access identical information, such as address or phone data of employees, customer information, etc. Distributing copies of commonly accessed data aids efficiency by providing immediate accessibility at each network location and avoiding the delays and additional network traffic from transferring data from a single source database.
One problem in such a distributed environment is ensuring that any changes made to one database are propagated to the other databases in the system so that common data remains consistent. This problem is exacerbated in a network that uses dissimilar (heterogeneous) relational database management systems (DBMS). Data must not only be propagated, but it also must be transformed from one database format to another. For instance, a DB2 database in one location of the network may need to be transformed to an Oracle format at another location, or data in non-DBMS files (such as VSAM files) may need to be transformed into a relational database format. In addition, different hardware configurations at the different locations on the computer network may require additional transformations.
Today, when organizations of all sizes are utterly dependent on the information stored in databases to conduct their most fundamental processes, businesses need better ways of extracting, transforming, moving and loading data across the enterprise. Therefore, a new set of tools are desired which provide improved methods for extracting, transforming, moving and loading data across the enterprise.