An important feature of the modern world is information: obtaining it; organizing it; storing it; accessing it, etc. Indeed, much of the world's computing power is dedicated to maintaining and efficiently using information, typically stored in databases. A database is a logical collection of data, in the form of text, numbers, or encoded graphics, organized for storage as a unit, typically in a computer. Databases are commonly organized into tables that are simple, row and column arrangements of related data that characterize persons, products companies, electronic mail, contact information, financial data, records of sales, performance data of processing units—anything about which data is collected. In a typical database, the rows of a table represent records, or collections of information about separate items. Each horizontal record contains one or more fields, representing individual data elements. Each vertical column of the table represents one field that is stored for each row in the table. The database records may contain any type of data and that data can be searched, accessed and modified by the user. Businesses and other organizations use databases to manage information about clients, orders, client accounts, etc. Businesses often need to manage and merge the data from many different sources, among which are “legacy” sources that use obsolete software or data structures.
Realizing the importance of meaningful storage of data, in 1970 Dr. E. F. Codd, developed the relational model of databases based on how users perceive data and a mathematical theory of relations. The relational model represents data as two-dimensional logical entities in which each logical entity represents some real-world person, place, thing, or event about which information is collected. A relational database is a set of tables derived from logical entities and manipulated in accordance with the relational model of data. The relational database uses objects to store, manage, and access data; the basic objects in the database are tables, columns, views, indexes, constraints (relationships), and triggers. Articles by Dr. E. F. Codd throughout the 1970s and 80s such as TWELVE RULES FOR RELATIONAL DATABASES and IS YOUR DBMS REALLY RELATIONAL? published in COMPUTERWORLD on Oct. 14, 1985 and DOES YOUR DBMS RUN BY THE RULES? published in COMPUTERWORLD on Oct. 11, 1985 are still referenced for implementation of relational databases. The twelve rules now number 333 rules and are published in The Relational Model for Database Management, Version 1 (Addison-Wesley, 1990).
A relational database stores information in tables as rows and columns of related data, and allows searches by using data in specified columns of one table to find additional data in another table. In searches, a relational database matches information from a field in one table with information in a corresponding field of another table and combines them to generate a third table that contains the requested data from the two tables. As an example of a relational database, if one database table contains the fields name, serial, address, phone and another table contains the fields serial, salary, and bonus, a relational database can match the serial fields in the two database tables to find such information as the names, and bonus of all people whose salary is above or below a certain amount. Thus, a relational database matches values in two tables to relate information in one table to information in the other table. Computer databases are typically relational databases.
If all the world's data were kept in relational databases, the world would be much simpler. Today, however, many of the small and large businesses that grew with the use of computers would simply hire programmers who would create new databases or data schemas from scratch, without considering the principles of relational databases. Before the advent of information technology as a discipline and a college major, these programmers would simply add fields that would serve a new purpose rather than redesign the whole business data from the ground up. So, in the world today, databases systems are collections of files stored on computers that may or may not be linked to other collections or data in the same system or other linked systems such as the Internet. Indeed, the World Wide Web (www) is essentially a large database comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. The volume of data available on the Internet is increasing daily, but the ability of users to understand and transform data on the Web, let alone their own data, has not kept pace. People need the ability to capture, use, manipulate and integrate data available on the Internet from multiple sources for such applications as data mining, data warehousing, global information systems, and electronic commerce. To transform data from one structure or schema to another, mappings must be created between the data source or a set of heterogeneous data sources and a target or integrated schema.
So in today's world, there is a critical need to be able to transform data efficiently in large amounts without requiring people to retype or reenter the data manually. To accomplish efficient data transformation, data from each field of the database containing information to be transformed is directed to the appropriate field of the receiving database. Great care is required in the transformation of data from one database to another as the data is easily misdirected to the wrong place in the receiving database, causing errors and confusion. In addition to placing the wrong values in fields, a poorly constructed transformation scheme can overwrite the existing good data in the receiving database.
To achieve the most accuracy, data transforms are often performed one record at a time. First, each field in the source table, from which data will be transformed, is mapped to a corresponding field in a destination table, which will receive the data, and then each record is transformed. In order to do this properly, the destination data will have the same overall table and field structure as the source data. In particular, the source fields that are to be transformed are typically from the same table and all the data values represented by the source data are typically stored in individual fields rather than embedded within one or more text fields. Further, there is a direct, one-to-one correspondence between individual source fields and individual destination fields; ideally, the corresponding fields would have the same names. In addition, the data types of the source and destination fields are compatible. For mappings to destination fields with value domains, which are enumerated sets of valid values for a destination text field, there is a direct, one-to-one correspondence between the values within the mapped source and destination field values, and the corresponding values are the same.
If the conditions discussed above exist, data transformation can proceed automatically without user intervention. The field mappings and the value mappings can both be completed automatically and easily. Unfortunately, these conditions rarely exist, leading to some or all of the following database transformation problems. The source fields to be mapped may not all belong to a single source table, so that the mapping tool must provide a mechanism to first join the tables together. Then, the mapping tool must provide a mechanism to lookup fields from each source table corresponding to fields in the primary source table in order to create virtual extended records for transform. One or more source data values may be embedded within longer descriptive text fields and these embedded values cannot be conveniently mapped to destination fields and data values without first being parsed, or separated, from the source field. Corresponding source and destination fields may not have the same name, requiring manual field mapping before proceeding with data transformation. For mappings to destination fields with value domains, the corresponding source and destination field values may not be the same, and destination values may not exist. During a record-at-a-time transformation, a mapping tool identifies and flags each discrepancy for exception handling, and the user must either manually identify the corresponding destination value or indicate that a new destination value should be created to correspond to the source value. Even when value mapping is not necessary, source data values might require manual adjustment or correction. Also, it may not be possible to map values directly because the appropriate mapping might be between combinations of two or more source values and combinations of two or more destination values, rather than between individual source and destination values. For mappings to destination fields without value domains, where value domains are an enumerated set of valid values for a destination text field, the data types of the corresponding source and destination fields may be different, requiring data conversion to the proper data type. For mapping to destination fields that represent physical measurements, the units of measure in the source data may be: (a) missing entirely, such as when all the values in a source field have the same implicit unit; (b) inconsistent, such as when different text strings are used to represent the same physical unit (e.g., inches, inch, in, or ″); or (c) different for each source value. In each case, the unit of measure strings must be normalized and appended to each numeric source value as part of the import process. The hierarchy of the source data might have been flattened into multiple fields in the source table. This flattening can occur when a category hierarchy is stored as three distinct fields in each source record, the first two indicating internal nodes and the third indicating the branch or leaf node in the category hierarchy. Values from two or more fields in each source record may require merger into a single value for transformation. This can occur when a numeric value and an associated unit of measure are stored in two distinct fields in each source record but are stored in a single unit of measure field in each destination record. A direct, one-to-one mapping may not exist between the values within individual source and destination fields, and instead, there may be mapping between values or value combinations in one or more source fields and values or value combinations in one or more destination fields. In this situation, a direct, one-to-one mapping does not exist between the values within individual source and destination fields because a direct, one-to-one mapping does not exist between the source and destination fields themselves. For mappings to a unique field, or field combination, that uniquely identifies each destination record, the mapping tool matches the existing destination records that correspond to each source record based on the value of the unique field. During a record-at-a-time transform, the transform tool flags each source record that corresponds to an existing destination record for manual exception handling, and the user indicates that the transform record will typically either: (a) be skipped, (b) replace the existing record or (c) update the existing record. Moreover, there may be more than one unique field in the destination records, and more than one match between one but not all unique fields are also resolved.
The number of possible, reasonable mappings between a data source and a target schema can be enormous and users may be unable to conceive of all the possible alternatives and have difficulty specifying the correct mapping for a specific application. Database schema mappings, moreover, are typically complex queries and subtle changes to the mapping such as changing an inner join to an outer join may dramatically change the target data results. Often enough, the user performing the mapping may not fully understand the source data or schema fully, especially when dealing with legacy data sources. Even so, given the complexity of the mappings and the subtle differences between alternative mappings, even an expert user may require assistance. The ultimate goal of schema mapping is not to build the correct query, but to extract the correct data from the source to populate the target schema but most data transformation (ETL—Extract, Transform, and Load) tools and query formation tools focus on building queries or transformation programs but provide limited support in verifying that the derived data set is correct.
Systems currently exist which allow a user to map data from one schema to another often require the user to be familiar with both the source and target schema. A typical user interface of such a system, moreover, uses a visual depiction of the database structure but not of its data content.
What is needed is a tool for databases that facilitates the transformation of data from other databases. Once the connections between the database table receiving data and the database table providing data are established, the tool facilitates the transformation of data with minimal user review. It is essential that the data transformation proceed with all of the incoming data being placed in the proper fields of the receiving database automatically.
When a user attempts to map the source data fields to the target data fields he/she may be presented with a target field name that appears multiple times. As such, the user may be unable ascertain which target data field should be mapped to a particular source data field. Thus, mapping data from one database to another is a time-consuming process that requires a programmer to know the details of a source database and then write a program or algorithm that extracts the data and correctly inserts the data into a target database. The process is replete with opportunities for errors. First, the fields between any two unrelated databases rarely have the same names, by way of example only, in a source database one field may be labeled NAME and in the target database, the analogous field may be labeled EMPLOYEE. Often, the documentation of the source database may be lost because the data is very old and so it may not be possible to determine the name of the field upon inspection. In the current environment there are no means that integrate disparate data sources without significant manual effort. When no familiarity exists with the source data base there is no universal technology that automatically maps source data to a target database.
Thus, what is required is a dynamic and automated method and service that discovers the names of fields in a source database and then extracts the data from the fields and maps those fields to a target database. These needs and other that will become apparent are solved by the invention as stated below: