The invention relates to data management and, in particular, to a method for mapping a data source to a data target.
Data analysts or data warehouse developers often have to solve data mapping problems when working on a data warehouse or when defining data transformation processes. In a typical scenario, new data is received from new data sources for loading into a data warehouse. The process requires that a data flow be defined specifying how the data sources are to be transformed and loaded into the target warehouse.
FIG. 1 illustrates a data flow 10 which can be used to illustrate the problem. A user obtains data from a first new source 12 (SCR1), a second new source 14 (SCR2), and a third new source 16 (SCR3) to a data target 18 (TAR1), such as a data warehouse. The semantics of the data target 18 is known and contains some data. The columns of the data target 18 are denoted, for example, with parameters FIRST_NAME, LAST_NAME, PROFESSION, SALARY, ADDRESS, PHONE, FAX, EMAIL. The first new source 12 includes columns with the headings A1, A2, A3, A4, and A5.
Similarly, the second new source 14 includes columns with the headings B1, B2, B3, and B4, and the third new source 16 includes columns with headings C1, C2, and C3. The semantics of the new sources 12, 14, and 16 is not fully known. A set of links 20 between the sources 12, 14, and 16 and the data target 18 exemplify one possible mapping: In this mapping, the column A1 from the first new table 12 contains the LAST_NAME information to be inserted in the corresponding column of the data target 18, as denoted by a link 22. The column C1 from the third new table 16 contains fax numbers which can be inserted into the column FAX of the data target 18, as denoted by a link 24.
Although the semantics of tables in the data warehouse may be known, and the data warehouse may already contain some data, but the documentation of the new data sources to load is often incomplete or may not exist. The schema, table and column names of the sources may not necessarily match those of the data warehouse, or the semantics of the columns of the sources may not be fully known, or the sources may contain a great number of tables and columns, making the search for the appropriate column difficult. The integration work in that case means that for each column in the target warehouse, the matching column providing the right information has to be found among all the columns contained in the new data sources. Or, alternatively, for each column in the data sources, the matching target column in the data warehouse has to be found.
Even if the semantics of both source and target are known, a large number of potential sources and targets and non-obvious or different naming convention can make the mapping work a very tedious task for the user. In that case, if a tool used to define the mapping can provide some help by suggesting the most probable source for each target or the most probable targets for each source, the time needed for this task can be reduced significantly.
To overcome this problem, some tools already try to provide some help to the user to find potential matching candidates. These prior art tools usually perform a syntax analysis of the metadata to find potential sources and targets having “similar” names and data signatures. In these analyses, only the metadata, that is, the name and definition of each column, are used to guess the mapping candidates. However, the name of the columns of the data sources, such as, for example, production data, can be very different from the names used in the target schema, such as in a data warehouse). The naming convention and the model used in a warehouse is usually designed to be easily understood by a human, while production data often use some schema and naming conventions which are not primarily designed to be comprehensible to a human. In that case, where the names used in the sources and targets have no similarity, an method analyzing the table and column names will fail finding matching candidates.
The names of the sources and targets may both use a similar naming convention, but these names may be in different languages or may follow different naming conventions. This is a common problem where data coming from different countries have to be integrated. The column names used in the different source countries may be expressed in the language of the respective country. In that case a method which only tries to find similar names between sources and target will fail in most cases. To overcome this problem, such methods may try to use a dictionary and check for synonyms and possible translations. The terms used as table and column names are rarely complete words, but are often shortened words or else may contain special characters or digits. Additionally, the data sources may not have complete metadata. A ‘flat’ file, for example, may have no column names defined. In such cases, the prior art methods will typically fail.
What is needed is a method that provides for the identification of mapping candidates even if the metadata or documentation are not complete, or even if the naming conventions used between sources or targets are different.