Information sources such as databases, spreadsheets, tables, and the like are well known in the art. As used herein, “information sources” is intended to broadly refer to data sets that allow some form of querying, either directly or indirectly, with an example of indirect querying as through a suitable “wrapper” layer that functions as a converter or interpreter. Information source examples include, but are not limited to, tables, databases, spreadsheets, web pages with or without forms, flat files, software with API's (application program interfaces), and the like.
When interpreting such a data set, there are different levels of understanding involved. For example, any data can be viewed at the bitstream level (zeros and ones) and the character level (e.g., ASCII or Unicode). Apart from this very low, close to physical level representation of information, higher level structures like records, sets, lists, trees, graphs, etc. are employed to provide better abstractions and handles for data and information manipulation. For example, a relational database hides its physical data organization from the user and only exposes a logical view of the modeled “mini world.” (every database can be seen as a representation of some aspects of the world, hence the term “mini world”).
This logical view is captured in the relational database schema. This comprises, for each relation, the relation name (=table), and the names and data types of the relation's attributes (=table columns). In addition to this already high-level logical view, there is a higher “conceptual level view” on the database which is often not made available to the user, either because there does not exist a formal (machine-readable) representation of that conceptual level view, or, even if it exists, e.g., in the form of an entity relationship (“ER”) or uniform modeling language (“UML”) diagram, this representation may not be linked to the database query mechanism in a systematic way.
This highest conceptual level representation of databases may be characterized in a conceptual model, often in a language such as ER or UML. A conceptual model represents knowledge that is not discernable from the face of the data source. For example, a data source's conceptual model can represent implicit “domain rules” (or “domain semantics”) that capture additional aspects of the source's modeled mini world.
By way of a simple example, assume a car manufacturer X is interested in answering a question of the form “which parts of the 1998 ‘Hector SUV’ were purchased or serviced most between Jan. 1, 1999 and Dec. 31, 2000”. A prior art database approach can answer such questions based on tables of the form:
Table Sold PartsTransactionCustomer IDDateVehicle IDModelYearPartQtyPriceIDNo
Table ServicesTransactionCustomerDateVehicleServiceServicePriceIDIDIDIDDescriptionNext, consider that car company X has a database at its production site that keeps track of the jobs performed by different machines in the shop floor. In addition to the job performed at each machine, the database keeps track of a “machine's health” by recording its service dates and errors produced by it.This information is maintained using the following simplified schema:
Table JobsMachineJobDateTimePartErrorCommentsIDIDNo
Table MachinesMachineMachineOper-LastNextMaximumCommentsIDTypeationServiceServiceErrorWith this schema a query such as “which parts were produced with machines whose cumulative error exceeded the maximum error before its service date, such that the date of the part production by the machine is after the date the machine exceeded the maximum error?” can be processed. The query result will represent “defective parts” that might have been produced by “defective machines”.
However, these two data sources, even if integrated, may not be useful in processing other queries. For example, consider queries directed to determining whether the defective parts produced by defective machines have any relationship with the parts that are purchased or serviced most in the parts shops. At a first glance it may appear that the two information sources could be joined with reference to their part numbers to process such queries. Such an approach, however, will produce only incomplete results. In particular, only those parts that were both defective and were serviced or purchased would be identified. Intuitively, this result is incomplete in that a specific defective part, say in the transmission, may not need any service at all but may instead cause other parts it interacts with to require service/replacement.
These example car-related data source and queries are fairly simple. Data sources may be much more complex, depending on the complexity of the mini-world they represent. Also, sophisticated data sources often have sophisticated query capabilities. Such sophisticated data sources may be found in the area of biological research, for example, where a genomic database may have the ability to search large amounts of genomic data to report similar gene sequences using complex and specialized string matching algorithms. As another example, macromolecular databases compare the 3D structure of molecules to determine their possible structural relationships.
As databases and other data sources have become more powerful and widely used, users are often faced with the task of obtaining information from a plurality of sources. Once again referring to the art of biological research by way of example, a biologist may assess different animal models to study different aspects of the same biological function. Thus, a biologist may wish to integrate, for instance, information from a first database regarding the brain from a rodent, from a second database regarding the brain from a primate, and from a third database regarding portions of primate and rodent brains that deal with vision. All three of these databases may have been created at different times by different researchers using different models (or the may in fact come from one common database/design process as in the car manufacturer example above). In particular, each individual database may have different semantics resulting in different structures, and have different query capabilities. As a result, there are numerous difficulties associated with attempting to universally query the databases.
Solutions to these difficulties have been proposed. For example, so called “mediator systems” have been offered to integrate data from different data sources. FIG. 1 is a schematic generally illustrating a prior art mediator system architecture. The mediator generally accesses data from the various databases by means of “wrappers” which sit “on top” of the sources and create a uniform access mechanism to them. The wrappers generally export data from the database in a common, often so-called “semistructured” language, so that any data (whether it is very structured like data from databases, or whether it is less structured such as certain HTML documents) from the various sources can be presented to the mediator in a uniform data language. A prominent semistructured data language is the Extensible Markup Language XML.
The user query is generally translated by the mediator into an XML query when issuing requests downwards, and XML result elements when sending back data upwards. The wrapper layer at each database translates the incoming XML query into a language native to the database query capabilities. Results obtained from each database may then be conveyed to the mediator in XML and presented through the user interface to the user.
For more information regarding mediator systems, reference may be made to D. Florescu, L. Rashid, and P. Valduriez, “A Methodology for Query Reformulation in CIS Using Semantic Knowledge”, Intl. Journal of Cooperative Information Systems, vol.5, no. 4, pp. 431-468, 1996, World Scientific Company; H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, V. Vassalos, and J. Widom, “The TSIMMIS Approach to Mediation: Data Models and Languages”, Journal of Intelligent Information Systems, vol. 8, no. 2, 1997, Kluwer Academic Publishers; V. Kashyap and A. Sheth, “Semantic and Schematic Similarities between Database Objects: A Context-based Approach”, VLDB Journal, vol. 5, no. 4, pp. 276-304, 1996, VLDB Endowment, Saratoga, Calif.; and Springer-Verlag; L. Haas and D. Kossman and E. Wimmers and J. Yang, “Optimizing Queries across Diverse Data Sources”, In Proc. International Conference on Very large Databases, Athens, Greece, pp. 276-285, 1997, VLDB Endowment, Saratoga, Calif.
Such prior art systems have proven useful when combining different data sources whose relevant (for the integration) classes, or “inter-source couplings” (“ISC”s) are more or less evident from the native source schema. For example, common, similar, or very closely related attribute names may indicate “joinable” columns—a very common ISC (e.g., the above relational database schemas, where “part number” may provide a relevant ISC). By way of further example, mediator systems as known have proven useful for applications such as comparison shopping for a particular appliance model on the world wide web where different vendors may use databases that have different structures, yet whose semantics make it fairly simple to integrate the sources to process a query using simple ISC's. In this example, it is straightforward to search for the appliance model number in the different databases and combine data from several databases into a single set.
For more complex integrations, however, prior art systems have achieved only limited usefulness. For example, different data sources may be difficult or impossible to integrate with systems and methods of the prior art if the data sources have relations that are not “apparent” to the mediator tool and that have not been encoded in them.
For these and other reasons, unresolved needs in the art exist.