There is acknowledged to be a general problem of data overload and information poverty. This arises because electronic data from different sources are stored on different computing systems, in different formats and described using different vocabularies. This makes it very difficult for computers in particular to process electronic data from these different sources in a way which enables the data from one source to be integrated with the data from another source. Differences in the vocabulary used within formal descriptions of data (e.g. ontologies and database schema) are often referred to as semantic heterogeneity and this causes particular difficulty when attempting to generate mappings between different formal descriptions of data.
Many solutions have been proposed for automatically generating mappings between descriptions of data such as ontologies and database schema. However, in general, all such methods ultimately require a large amount of human time and effort. Given the enormous amount of electronic data stored in semantically heterogeneous sources, progress in enabling data from these sources to be integrated is very slow.
One particularly promising method of integrating data from semantically heterogeneous sources is to use “ontologies”. A popular definition of an ontology is that it is an explicit formal specification of a conceptualisation. Formal here means some logical formalism. A conceptualisation includes: concepts, which may denote real or abstract entities such as Person, Animal, Dog, Mood and Condition; each concept may have attributes (for example a concept Person may have attributes such as name, sex, date-of-birth and eye-colour); and relationships, such as “Person is a sub category of Animal”, “Person has Dog as pets”, and “Person has Mood”. Attributes and relationships add information to concepts, and can be thought of as giving a sort of formal meaning to these concepts. Furthermore, an ontology normally includes axioms which further constrain the interpretation of concepts. Example axioms are: “Person and Dog are disjoint”; “Mood cannot apply to Condition”, and so on. The specification of an ontology for a domain depends on the point of view of the author of the ontology. A domain could be modeled differently for different purposes. As models always simplify the reality, there are often different ontologies even for the same domains.
Ontologies can assist in integrating data from semantically heterogeneous databases by providing precise, machine “understandable” definitions of what differently used terminologies, in the different data sources to be integrated, actually mean. For example, a first database might refer to “model ID” and a second database might refer to the same category as “product No.”. Each of these could be mapped to a corresponding concept in a common ontology (e.g. “Product.Identifiction_Code”) and then a user or software application which refers to the concept “Product” can successfully obtain actual data from both of the integrated data sources by using the respective mappings between the common ontology and the descriptions (e.g. database schema) of the first and second databases.
Difficulties that need to be overcome by an automated mapping process include the fact that the same or very similar terms may be used in different data descriptions to mean very different things and very different terms may be used in different data descriptions to mean the same thing (this is a consequence of natural languages such as English behaving in the same way—i.e. the same or similar words in English may have more than one, possibly very different, meaning and very different words may have the same or very similar meanings). When human beings converse they can usually easily discern the correct intended meaning for a given word based on the context in which the word is used but that context is often not available in a machine understandable format to a computer when it is attempting to perform a mapping operation between an ontology and a formal data description such as a database schema. Additionally, data descriptions (especially database schema) often use abbreviated terms rather than actual real words (e.g. “CName” instead of “Customer Name”, etc.).