The World Wide Web (or simply, the “Web”) provides users with networked access to large amounts of information from a large number of information sites. However, much of that information remains technically and/or practically inaccessible due to being stored in database systems of varying forms. Also, it is difficult for users to collate information from many different data sources, where the desired data may be stored on the Web in some combination of database systems or structured documents.
Although interfaces to the Web can provide users with access to information stored within specific databases, each database typically uses its own vocabulary to describe the stored information. In addition, similar information is often stored in many disparate ways (eg. monetary information can be stored as $US in one database, and Japanese Yen or $AU in other databases). These problems make it difficult for users to collate related information from different data sources. There exist systems that attempt to provide interoperation and varying degrees of integration among multiple databases. These systems have been called multi-database systems, federated databases, and more generally, heterogeneous distributed database systems.
One form of multi-database system is where there is a global schema for the integrated system. A schema is a specification of the structure and interrelationships of a class of data. In such a case, individual schema of the constituent data sources are tightly integrated into the global schema. In this regard, the schema must either use the same vocabulary and semantics as the global schema or there must be a clear mapping to components of the global schema. This case is relatively easy to implement if the databases have a similar nature or purpose, and most easy to implement if the databases all select to use a single vocabulary. However, this is difficult to achieve when the information to be integrated is varied in nature or purpose, or the information sources have been compiled without a view to being integrated. For example, a database system designed to hold metadata for media objects would most likely use very different terminology to a database containing schematics for motor vehicle assembly. Multi-database systems, which are based on a global schema generally result in a loss of autonomy by the information owners and lack extensibility, two features which are undesirable and unnecessarily restrictive in a networked environment such as the Web.
Consequently, less tightly coupled database federations have become more popular in more recent times. In these systems, the extent of schema integration depends on the needs of users. Typically in these systems, each data source has its own local schema, also known as the “export schema”. An export schema can be integrated to different extents with a common data model for the federation, and such integration may involve mapping the local schemas of the constituent data sources into a common component schema, in order to alleviate data model heterogeneity. An example of this is found in the Object Exchange Model of the TSIMMIS project at Stanford University. In some cases, the mappings are provided by the local schema administrator, while in other cases the mappings are provided and stored centrally (for the federation) with the facility for further mappings to be added when new data sources need to be accessed. The mappings are typically represented using procedural code and are only understood by the particular federated system. Examples of the use of procedural code to effect mappings include the repository wrappers in the Garlic project of IBM Corp., and TSIMMIS mentioned above. Users of the federated system typically interact with the global schema, which is a result of the applied mappings. Therefore the existence of the mappings is effectively hidden from the user.
The next stage in the transition to even more loosely-coupled systems, is where users can effectively create their own personalised federations. In these cases the responsibility of maintaining a consistent view of the information falls upon users of the information and may involve users creating a view of the information against the export schemas of information sources that they want to access. Such user-defined federations can be more extensible and therefore are more suited to the web environment. Web standards, such as the XML Query (XQuery) language and XML Schema, provide an opportunity for data sources to become accessible using the HTTP protocol. For example, Tamino (XML) database systems can be queried using XQuery over HTTP. These web standards can unify access to databases and structured data (eg. XML). User-defined federations over the Web can therefore be more extensible. However, they rely on users being more knowledgeable about the contained data and tools used to manipulate the data, for example to transform the data to a common form for the purposes of collation or comparison.
User-defined federations also suffer greatly from the problems implicit in the use of multiple vocabularies by the creators or owners of the individual data sources. For example, one data source may store names in the form Surname, GivenName whereas another data source may store names as two separate fields, Surname and GivenName, or SecondName and FirstName. This heterogeneity of vocabulary makes it difficult for users to pose queries across the multiple data sources. In other words, if a user was searching for references to a person with a surname of “SMITH”, the user would typically need to enter a constraint for each data source using the terminology of that source. For example, where Name starts with “SMITH”, OR Surname=“SMITH” OR Second Name=“SMITH”. The user may also need to consider whether the data in the data source is stored in upper or lower case. These problems were also experienced by the previous generation of loosely-coupled federations but were solved by having expert-defined mappings that resolved the problem for a particular federation. For example, U.S. Pat. No. 6,009,428 describes a method of mapping between data types that utilises functions that are local to particular databases and that implement conversions. However, like much of the art in this area, these mappings are pre-defined and performed by a data-translation procedure that is hidden from the user.
In order for users to gain seamless access to heterogeneous data sources it is necessary to be able to create and apply mappings to the export schemas of the data sources that are of interest to the users. Such mappings preferably define how data from the individual sources are to be combined, transformed or renamed into terms that the user understands. This collection of mappings results in a user-generated view of the selected information sources and represents a re-mapping of the live data rather than the creation of a new layer of data, which describes in some form the original (live) data. Such views therefore differ from techniques such as the ISO standard Topic Maps, which define a further layer upon the data where the subjects and relationships of the data are modeled as exchangeable collections of topics and associations.
In the past, the creation of mappings has been the task of a system administrator, or someone who understands the principles of mathematical and functional transformations. In most cases, as described above, the mappings are implemented using scripts or procedural code and therefore hidden from the user of the information. Typically, an administrator may set up a set of mappings for a collection of databases that may be required by an organization and everyone in that organization would use the created mappings to access the data.
Another related area in which mapping transformations are currently used is the area of data warehousing. In such instances, the Extract/Transform/Load (ETL) methodology is typically used to select information from data sources, transform the information in some way and then load the transformed information into a data warehouse. Typically, these tools are also used by database administrators or data warehouse managers and involve defining transformation expressions which include functions, statements, operators and data fields. For example, Data Junction data warehousing software contains a map designer module, which allows users to define mapping transformations using an expression builder tool. This tool requires users to enter expressions using functional notation. For example, the expression “DateConvert(“mmddyy”, “mm/dd/yyyy”, field1)” converts a date of the form 070801, that is stored in variable “field1” to form 07/08/2001. Clearly, although the building of expressions of this form may be acceptable to a person knowledgeable with respect to functional notation, the average user of the Web may not be inclined to use such tools.
Further, information that is desired to be collated from a number of different sources can be referred to as a “view” of the data. Commonly-used relational database management systems (RDBMS), such as provided by Oracle™ or Microsoft™, often provide users with a graphical user interface (GUI) to design views across tables in the database system. These GUIs are designed to remove the need for the user to create views by directly writing queries to the database system, such as SQL queries of SQL databases. However these views are typically limited to tables stored in the RDBMS.
More general reporting systems, such as Brio Intelligence (Brio Software™, Inc.), Business Objects (Business Objects™) and Crystal Reports (Crystal Decisions™) allow users to design views or reports across known sets of data sources, where the data sources are generally accessed via proprietary wrappers. Users can design reports by selecting data sources and viewing all the data components (or fields) that can be used in the reports. Data components of interest can then be selected for inclusion in the report. However, this method requires that the user be a report designer, and understand the relationships between data components of different data sources. Such a method is typically suited to smaller corporate environments, where the people who generate the reports are usually very familiar with the different corporate data sets.
A common desire of users of information is to be able to easily create views of data across heterogeneous data sources (ie. data sources of different form and structure) without having to “personally” know or understand the relationships between data components of the different data sources. These data sources may be distributed over an Intranet or, as is so often the case recently, over the Internet. In most prior art arrangements the procedures that must be followed for users to create new reports or views are typically designed for users having a good understanding of database terminology and procedures frequently used for report generation. This class of user is often satisfied with an approach where the data is first collected from a list of tables and their contained fields because they understand what the field names mean and can recall the conventions (eg. style, units, etc.) used in compiling the data. The average collator of information (eg. a business/information analyst) may not be so familiar with the data and the specifications of the data in terms of the tables and fields (ie. schema). This class of user typically accesses a large number of different data sources on an irregular basis, and is often dependent on a database administrator or experienced report writer to obtain the data required.
This problem is exacerbated in situations where the data must be collated from multiple heterogeneous data sources. In these cases, the way the schema of each data source is presented to the user could vary. For example, one data source might be hierarchical and so the schema may be represented in an hierarchical form. Another data source could be relational and its schema could be represented using a list of tables and relationships between tables, such as used by products such as Microsoft Access™. Neither of these schema formats inform the user about the “form” of the data. For example, is an “Amount” column in a relational table a value in US dollars, Japanese Yen or Euro? Similarly, the only way the user can find out whether a “Name” column includes both given and surnames is to actually select that column for inclusion in the report and then inspect the data. Some prior art arrangements do provide a data preview functionality, which alleviates this situation to some extent.
Another limitation of the prior art is that the creators of new reports and views cannot easily use information deduced by others in the formation of their reports.
A further limitation of the prior art is that new reports or views are typically created by first collecting the required data and then deciding on a graphical form of the report (eg. table, line graph, scatter plot). However, users increasingly expect to move more directly to their desired end result. For example, if a user already knows that he/she wants the report presented as a line graph, he/she should not need to first collect the data in a table then create a graph from the table.