There is a generally recognised problem often referred to as data overload and information poverty. This refers to the fact that although there is a vast amount of data stored in databases throughout the world at the present time, accessing and processing the data from various different databases, even where the are linked together by an appropriate data network, in order to obtain useful information from the databases is not straightforward. This is because the data tends to be stored in many different formats and in order to process the data appropriately (e.g. to combine data from different databases) considerable knowledge about the format of the data is required. (Note, the term database is used here loosely to refer to any type of electronically accessible storage of data, whether it be in a well structured format such as in a relational database or an object-oriented database, a semi-structured format such as a store of eXtensible Markup Language (XML) documents or in an unstructured form such as a plurality of electronic text document files, image files, video files, Hyper Text Markup Language (HTML) or other types of computer files, etc. The term database may be used interchangeably with the term “data source” throughout this document).
There has been much research into this area. A paper by Patrick Ziegler and Klaus A. Dittrich (2004) entitled “Three Decades of Data integration—All Problems Solved?” published in the proceedings of the World Computer Congress 2004—WCC 2004, 3-12, provides a good overview of research into this field and explains how there are many different architectural levels at which integration between heterogeneous data sources may be attempted. For example, at the lowest level it may be attempted by combining the data at the data storage level—this involves migrating the data from a plurality of separate data sources to a single database with a single interface for querying the database. Towards the other extreme, a user could be provided with a common user interface, but the underlying data remains transparently located in separate databases and the user must combine the information from the different databases him/herself.
The present applicant has previously developed a data integration system and methodology described in International published patent application: WO 02/080028. In this system, the heterogeneous data sources to be combined are maintained as separate databases and a series of wrappers are used to interface between the databases themselves and the system. The wrappers also translate or map queries expressed in a “resource” ontology to the query language/schema supported by the underlying resource (i.e. the underlying database). The system then maps between the resource ontology and a global ontology or an application specific ontology which the user uses to formulate global queries. Note that in this system, as in other systems of which the Applicant is aware, the system always seeks to integrate as many of the useful underlying heterogeneous data sources as possible.