Database systems are collections of files stored on computers or other linked systems such as the Internet. The files together contain all the information about a topic or related topics. Generally, a file system is used to “file away” information which a user will later retrieve for processing. Normally, a file resides in directly accessible storage and can be manipulated as a unit by file system operations. A file system allows a user the means for storing data in files, accessing the data in those files, managing direct access to storage space where the files are stored, and guaranteeing the integrity of those files. Businesses and other organizations use databases to manage information about clients, orders, client accounts, etc. Businesses often need to manage and merge the data from many different sources, among which are “legacy” sources that use obsolete software or data structures.
The World Wide Web (WWW) is essentially a large database. The WWW, or Internet, is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. The volume of data available on the Internet is increasing daily, but the ability of users to understand and transform this data has not kept pace. Businesses need the ability to capture and manipulate data available on the Internet for such applications as data warehousing, global information systems, and electronic commerce.
E-commerce and other data-intensive applications rely on the ability to capture, use, and integrate data from multiple sources. To transform data from one structure or schema to another, mappings must be created between the data source (or set of heterogeneous data sources) and a target or integrated schema. While important advances have been made in the ability to create and manage these mappings, a number of important issues remain.
First, the number of possible, reasonable mappings between a data source and a target schema can be enormous. Users are unable to conceive of all the possible alternatives and may have difficulty specifying the correct mapping for a specific application. A system is needed that considers, manages, and ranks alternate mappings while allowing the final choice of mapping by a user who understands the semantics of the target schema.
Second, schema mappings are typically complex queries. Subtle changes to the mapping such as changing an inner join to an outer join may dramatically change the target data results. In other cases, the same mapping change may have no effect due to constraints held on the source schema. A system is needed that embodies the complex query and constraint reasoning needed to understand subtleties while effectively communicating these subtleties to the user.
Third, the user performing the mapping may not fully understand the source data or schema fully. This is especially true when dealing with legacy data sources. A system is needed that facilitates both schema and data exploration. In addition, the system must leverage the parts of the data and schema that the user does understand to gain maximum advantage in forming the schema mapping.
Fourth, given the complexity of the mappings and the subtle differences between alternative mappings, even an expert user may require assistance. To select a specific mapping, the user must understand the mapping plus how the specific mapping differs from other possible mappings. A system is needed that allows the user to map portions of the source data to the target without exposing the user to tangled SQL (Structured Query Language) or complex QBE (Query By Example) queries. The system should also illustrate the effects of alternate mappings, helping the user differentiate between subtle changes.
Fifth, data merging queries require the use of complex, non-associative operators. Reasoning about such operators can be extremely difficult for both users and query management tools. Because the operators may not be associative, even managing compositions of queries can be a difficult task. However, to be scalable to large schemas, mapping tools must permit users to incrementally create, evolve, and compose such complex queries. A system is needed that develops a mapping representation and set of mapping operators permitting the incremental creation and management of large, complex mappings.
The ultimate goal of schema mapping is not to build the correct query, but to extract the correct data from the source to populate the target schema. Current data transformation (ETL—Extract, Transform, and Load) tools and query formation tools focus on building queries or transformation programs, but provide limited support in verifying that the derived data set is correct. If a query or transformation is incomplete or incorrect, there is typically no support for refining and correcting it. The user is expected to have a thorough understanding of the data source and to debug complicated SQL queries or procedural transformation programs by hand.
Systems currently exist which allow a user to map data from one schema to another, as in System and Method for Schema Mapping, ARC9-2000-0125. This system allows the user to quickly transfer data from one schema to another, but requires the user to be familiar with both the source and target schema. In addition, the user interface with the system uses a visual depiction of the database structure but not of its data content. Another system, “User-Defined Relationships For Diagramming User-Defined Database Relations” RO999-167, provides a graphical interface for the user. This interface allows the user to visualize the structure of a schema, but does not provide for transfer of data from one schema to another and does not use data to guide the user.
Some ad hoc query tools focus on helping users access data using natural language or through a point-and-click graphic user interface (GUI). These requests are processed by a meta-data layer that translates them into SQL queries. In these systems, the user does not have to know SQL, understand the schema, or know how attributes are decomposed among relations. The translation process is hard-coded using (often procedural) transformation programs. These programs are provided by an expert data administrator with complete knowledge of the data. Most of these tools are tightly integrated with a report generating facility so they can readily display the query result. However, the tools do not allow the users to verify or rectify queries by working with the displayed data. Visual query builders focus on helping users compose SQL queries faster and with fewer errors, rather than focusing on understanding the data source using data and allowing users to refine their queries using data examples.
Thus, there is need for a system that provides a graphical interface for transferring data from one complex schema to another. This system should create, manage, and rank alternative mappings. In addition, the system should provide facilities for schema and data exploration to help the user understand the source structure and data. The system should also permit users to incrementally create, evolve, and compose complex queries that are scalable to large schemas. Most importantly, the system should support the understanding and verification of the correctness of complex data-transformation queries. The need for such a system has heretofore remained unsatisfied.