1. Field of the Invention
The present invention generally relates to correlating different data sources and, more particularly, to correlating data from at least two different data sources in a data warehouse.
2. Description of the Related Art
Computerized information storage and retrieval systems made up of a (possibly large) number of files or tables are available in form of databases. The most prevalent type of database is the relational database, a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways. Storing and retrieving data in a database is performed by means of a computer database management system (DBMS).
Regardless of the particular architecture, in a DBMS, a requesting entity (e.g., an application, the operating system or a user) demands access to a specified database by issuing a database access request. Such requests may include, for instance, simple catalog lookup requests or transactions and combinations of transactions that operate to read, change and add specified records in the database. These requests are made using high-level query languages such as the Structured Query Language (SQL).
A relational database management system (RDBMS) is a DBMS that uses relational techniques for storing and retrieving data. RDBMS software using a SQL interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Institute (ANSI) and the International Standards Organization (ISO).
Illustratively, SQL is used to make interactive queries for getting information from and updating a database such as International Business Machines' (IBM®) DB2®, Microsoft's SQL Server®, and database products from Oracle®, Sybase®, and Computer Associates®. The term “query” denominates a set of commands for retrieving and processing data from a stored database. Queries take the form of a command language that lets programmers and programs select, insert, update, find out the location of data, and so forth.
Data records in a computerized RDBMS are maintained in tables, which are a collection of rows all having the same columns. Each row represents a data record and each column maintains information on a particular type of data for the data records that comprise the rows. Data records may be indexed using unique indices or keys to join different data records in different related tables together. A preferred model is for the keys to be internal numeric identifiers that can be assigned incrementally, rather than directly mapping the data records to some external identifier.
An internal identifier is an identifier that is generated by means of the RDBMS or application and is used to correlate information across tables of the application. Internal identifiers are not externalized outside the application. An external identifier is an identifier that may be generated by means of the RDBMS or application or supplied by the end-user and uniquely identifies a corresponding entity in the real world. Assume that in a credit card transaction system of a financial institution, a data record is generated for each client in a table comprising information related to the clients, each data record representing a row of the table. Assume further that the table comprises a plurality of columns, i.e. last name, middle initial, first name, street, city, state, ZIP code and credit card number. Each data record thus contains information related to one client, i.e., last name, middle initial, first name, street, city, state, ZIP code and credit card number of the client. To uniquely identify each data record, in general, a 32-bit integer is generated by means of the RDBMS or application as internal identifier and stored in an additional column of the table. However, this internal identifier does not comprise relevant information regarding the corresponding client and may only be used to uniquely identify the data record related to this client. In contrast thereto, the credit card number, which has not been generated by means of the RDBMS, and which is located in each data record, also uniquely identifies the corresponding data record and client and represents an external identifier in that the client is aware of this number and uses it to identify him or herself.
Where internal identifiers are used, a mapping table may be provided to map internal identifiers to their respective external identifier. Accessing tables associated with an external identifier is then accomplished by using the external identifier to index into the mapping table and locate the corresponding internal identifier. The internal identifier is then used as a key to access the relevant tables. Assignment of a new external identifier for a data record in a table (e.g., TABLE 1) can be handled simply by changing the relevant row of the mapping table to map the new external identifier to the corresponding internal identifier of TABLE 1. A history of external identifiers can be kept in a history mapping table which maps past external identifiers to the internal identifier. In this manner, a single unchanging internal identifier can be used to access all data associated with any number of related external identifiers.
To illustrate the advantages of using an internal identifier as a key, consider in the above described example the assignment of a new credit card number (a new external identifier) to replace the old credit card number (the old external identifier) of a card that was lost or stolen. The data for the card account, such as credit history and balances, must still be associated with the same entity (e.g., client). As a result, if the old external identifier, i.e., the old credit card number was used as the key across the tables, all tables using the old external identifier to uniquely identify corresponding data records and thus containing references to the old credit card number would have to be updated to reflect the new external identifier, i.e., the new credit card number. Dependent on the number of tables to be updated, this may, however, be complex and cumbersome. Additionally, by simply replacing the old credit card number with the new credit card number, history about the old credit card number(s) for the entity's account is lost, as no link between the old and the corresponding new credit card number could be established. In order to avoid loss of history of the external identifiers, a history mapping table or a new column in one of the updated tables must be created, which maps the old external identifier to the new external identifier, i.e., the old credit card number to the new credit card number. In contrast thereto, there is no need to change an internal identifier used as the key across the tables, as the internal identifier is used within a corresponding database for uniquely identifying and joining data records, so that, for example, only one entry in a corresponding history mapping table needs to be updated to map the old credit card number to the new credit card number.
Although using internal identifiers in the foregoing manner simplifies maintenance of relational databases, it complicates construction of a data warehouse. By definition, data warehouses pull together data from different data sources and try to establish correlations between them. Thus, queries may directly be run on the data stored in the data warehouse so that the performance of the computer system that issues the data required to satisfy a corresponding query will not be affected.
In constructing a data warehouse, data from many different databases, each using their own method for generating unique internal identifiers, have to be correlated. Continuing with the credit card example and assume that the financial institution offering the credit card also has a rewards program for using the card which uses the credit card number as the external identifier for the rewards program. Assume further that the credit card transaction system is written by a different software vendor than the rewards tracking program. Since the external identifier, i.e., the credit card number is common to both data sources, it would seem logical to use it to correlate the data from the two data sources when constructing a data warehouse on the basis of these data sources. However, such correlation would require adding the credit card number as a key to the tables copied to the data warehouse or always joining the tables in the data warehouse via some mapping table.
Adding the external identifier as a key also introduces the problem described earlier where changes to the credit card number can require changes across many tables in the data warehouse. The problems with using external identifiers as keys are compounded because, in the data warehouse, data from one data source or application may be using a newer external identifier than another application(s) during the window of time that the applications are updated. Further, performing a join with some mapping table can severely impact the performance of queries against the database. To complicate matters further, many times data in a data warehouse must also be correlated back to the original data source to allow capturing of changes of the original data source.
A recent trend is building data warehouses with real-time data. In some cases, such data warehouses consist of some or all of the data being replicated to the data warehouse rather than more traditional extract-transform-load processes. In general, replicating data represents moving data from the original data source to the data warehouse, wherein only changes are copied, but not the whole data. In the extract-transform-load processes, extracted data is typically entirely copied, i.e., converted/transformed and processed during the extract process according to the requirements of a corresponding data warehouse and has more freedom to change values as long as some method to map the data back to the original data source exists to allow incremental updates. In contrast, replicated data is merely copied without transformation. As a result, the internal identifier values for replicated data cannot be changed when moved to the data warehouse.
Accordingly, there is a need for a more efficient and effective technique for correlating different data sources in a data warehouse.