The present invention relates to information management systems and, more particularly, to a method and system for performing metadata reconciliation in a data warehousing environment.
A data warehouse is a centralized collection of data. Data warehouses are ideally suited for supporting management decision-making in business organizations since data from disparate and/or distributed sources may be stored and analyzed at a central location. For example, a financial services organization may store and aggregate in a data warehouse large amounts of financial data obtained from its regional office databases around the world. Various analytical and reporting tools may then be used to process the aggregated data to present a coherent picture of business conditions at a particular point in time, and thereby support management decision making of the financial services organization.
Data warehouses are typically implemented on a Database Management System (DBMS) that includes a large database for storing the data, a database server for processing queries against the database and one or more database applications for accessing the DBMS. The types of applications that are provided for a data warehouse vary widely, depending upon the requirements of a particular implementation. For example, a data warehouse may include an application for configuring the database schema used for the data warehouse database. As another example, a data warehouse may include an application for extracting data from source databases and then storing the extracted data in the data warehouse. A data warehouse may also include an application for generating reports based upon data contained in the data warehouse.
Building a data warehouse is not an easy task. And, often times, after a data warehouse is well designed and built, it is desirable to introduce changes to the warehouse for any of a variety of reasons. A few examples of changes that impact a data warehouse design include a new column in a source table utilized in the warehouse, a new requirement on the information stored in the warehouse or a new query on the data warehouse to name a few. When such changes are present, the data warehouse design becomes out of sync and requires fixes. The process of fixing or modifying a data warehouse design so it is in sync with the new changes is called reconciliation and focuses on reconciling metadata.
As used herein, the term “metadata” refers generally to data that defines other data. In the context of data warehousing, the term “metadata” refers to data that defines data that is stored in a source database or in a data warehouse. For example, in the context of data warehousing, metadata may include the database schema used in a source database or in a data warehouse. As described in more detail hereinafter, metadata may define not only the final data that is stored in the data warehouse, but also intermediate data and structures, such as staging tables that are used to determine the final data.
The term “metadata reconciliation” generally refers to updating the metadata of an object to reflect changes made to the metadata of another object. The need for metadata reconciliation arises when there is a relationship between the two objects and the relationship is affected by the change to the metadata of one of the objects. For example, suppose that the definition (metadata) of object B depends upon the definition (metadata) of object A. A change to the metadata of object A disrupts the relationship between objects A and B. Metadata reconciliation may be used to update the metadata for object B to reflect the change made to the metadata for object A.
In the context of data warehousing, metadata reconciliation refers to updating the metadata for a data warehouse to reflect changes made to the definition of an object that one or more other objects in the data warehouse rely upon. The object may be external to the data warehouse, e.g., a source database object, or internal to the data warehouse. For example, changing the design of a data warehouse may change the definition of an object within the data warehouse, such as an intermediate table. When the definition of an object changes, the metadata for a data warehouse must be updated to reflect the changes made to the definition of the object before additional data can be imported into the data warehouse. Otherwise, attempting to store new data that conforms to the updated definition will cause errors in the data warehouse.
For example, the database schema used in a data warehouse is designed based upon the database schemas that define the corresponding source databases. Each source database may have its own (different) database schema and all of the source data must be aggregated onto a single database in the data warehouse. To achieve this result, a data warehouse typically includes a transformation mechanism, usually in the form of a data warehouse software application or module that is configured to transform source data that conforms to the database schema used in the source databases, into data that conforms to the database schema used in the data warehouse. The extent of transformation that must be performed generally varies depending upon size of the data warehousing deployment and differences between the source database schemas and the database schema used in the data warehouse. For example, in data warehousing deployments with large numbers of disparate source databases, the transformation process can be very complex.
Data warehouses are conventionally maintained manually by warehouse designers who, in response to a change made to a source database schema or the design of the data warehouse, must first identify the data warehouse objects, such as tables, transformation mechanisms and applications affected by the change, including any temporary objects and tables, and then update these temporary objects, tables, applications and the data warehouse schema to reflect the change made to the source database schema or the design of the data warehouse. Updating the database schema of the data warehouse can be very complex and require a significant amount of human resources, even just to identify the changes in metadata that must be made. For example, a warehouse designer may have to manually inspect a large number of database schema objects, including objects involved in the transformations, to determine which of these objects rely upon the objects that have been modified. This problem may be compounded when, because of complex data dependencies in the data warehouse, changes made to the database schema of the data warehouse affect other database schema objects that then must also be updated. Thus, changes to a database schema of a source database, or to the design of a data warehouse, that initially appear to directly affect only a few database schema objects in a data warehouse may in practice indirectly affect large numbers of database schema objects in the data warehouse. Hence, even seemingly small changes to the schema of source databases can require a significant amount of human resources to reconcile data warehouse metadata.
While a variety of tools have been created to facilitate the data warehouse reconciliation task, there is still a need for improvements in the reconciliation process.