Today, it is recognized that knowledge is one of the most important assets of organizations. It is a challenge to be able to manage these knowledge assets. Advanced knowledge management requires thorough analyses and interpretation of all available data either of a technical or a non-technical nature pertaining to one or more application domains and of any type such as a linguistic data type, an image data type, a video data type, a sound data type, a control data type, a measurement data type, olfactive and tactile data types. Knowledge regarding processes, products, markets, technologies and the organization likewise have to be processed. This ultimately enables the organizations to make profit.
The required data analysis, integration and exploiting technology has to meet a number of functional requirements, for the end-user, the domain expert and the knowledge engineer.
For the end-user a knowledge management system should incorporate a multi-disciplinary view on the application domain which incorporates the different kinds of data and knowledge and their concomitant inference techniques to perform the data and knowledge intensive tasks in the organization. The knowledge management system should be able to make the inference transparent to the end user. Further, the system should be able to interoperate with other software components available to the user.
For the domain expert it must be easy to create and to maintain a knowledge base (both declarative and procedural knowledge) and to interact with the knowledge base in his own terminology, in order to incorporate easily the continuously evolving knowledge.
For the knowledge engineer the effort of updating the system should be reduced to a minimum, both when new knowledge becomes available or in the case of new inference paths. Domain specific data and knowledge should be shared by different users processes and reasoning strategies to solve completely different tasks and reasoning components should be reused across divergent application domains. This, to save development effort in building new application domains or advancing existing knowledge management systems.
Most information technology (IT) employed to enable knowledge work appears to target data and information, as opposed to knowledge itself. Present IT systems used to support knowledge management are limited primarily to conventional database management systems (DBMS), data warehouses and data mining tools (DW/DM), intranet/extranet and groupware.
In these existing systems the underlying representation of the reality domain that is used as a starting and reference point for the supported knowledge related activity is not sophisticated enough to model all the different data types, levels and aspects within the reality domain.
In some cases (as e.g. most automatic translation tools or text summarization tools) such a representation or ontology is simply lacking. These tools partially try to solve the lack of an adequate representation of the real world by trying to deduce semantic information (which can be simply described as the meaning of words and sentences) directly from an analysis of the syntactic structure of an utterance. Other systems, such as the more advanced information Extraction Environments (IEE) or Description Logics (DL), are developed as a specific application designed for a specific operational scope such as natural language understanding (KRSS), medical information modeling (GRAIL) or improved data representation (KLIPSCH).
The basic characteristic of these environments or Description Logics is that, even if they make the relevant distinction between procedural and factual knowledge, their formal representation of the reality domain is mono dimensional and oriented to only one data type, being the linguistic one.
This means that most information extraction environments (IEE) and description logics (DL) focus on the representation of factual knowledge (i.e. data) in a framework of formally defined concepts and (explicit) relations between them. In order to make the representation formally as correct as possible, so that it complies in the best possible way with the needs of the inference languages linguistic phenomena or even the notion of language are as much as possible discarded from the representation.
Instead, they try to deal with the reality that concepts can be linked at different levels (e.g. linguistic and logical) by different kinds of relations with the fuzzy and non formally provable distinction between functional (grammatical) and sensible (semantic) relations. This distinction between different relation types is also used to introduce causality and temporal relations within the same mono-dimensional framework.
Within this mono-dimensional framework, that has to fulfil representative tasks at different levels, abstract sanctioning is foreseen for judging “correctness” of representations or analyses. This sanctioning is abstract because it does not imply characteristics of all the actors (concepts, states) in a specific relation, but it is based on the different types of relations, generally following the principle that sensible relations cannot contradict functional relations.
Even if for global modeling purposes this abstract mechanism can be useful to speed up the modeling process, it inevitably leads to accuracy problems and over-generation when the relations are defined at a too high level of abstraction where the distinction between functional and grammatical relations is characterized by a pretty high degree of fuzziness. Or in the opposite case, when the modeling does not reach a sufficient degree of abstraction such that a significant generalization is almost impossible. The consequences for information extraction applications are clear; in the first case the system is not able to discard “incorrect” information and in the second case correct information is easily rejected as “incorrect”.
Another problem of most description logics (DL) and information extraction environments (IEE) is the fact that there is no automatic link between the descriptive modeling of the factual or logical knowledge and the inference mechanism that is used to exploit the modeled information. In other words, since information extraction environments (IEE) and description logics (DL) are mainly descriptive environments they do not provide the adequate infrastructure to exploit the represented knowledge. Therefore, other environments or inference languages such as OIL or ProtéGé have been developed. However, the problem is that description logics essentially are logical data representations, whereas inference languages mostly rely on a frame based data representation and both representations are not completely compatible, which means that they cannot fully exploit each other.
Accordingly, the technology required for an intelligent knowledge management and processing system cannot be built in terms of existing database, technology, because such technology does not support the rich representational structure and inference mechanisms required for knowledge based systems and often has problems with efficiently storing different data types within the same environment.
Most of the current state of art knowledge based systems are rule based systems in which the knowledge is represented at a single level of abstraction and implicitly combined knowledge about ‘how’ to perform a task, ‘what’ is in the domain and ‘why’ things work. While it was initially thought that this would make systems fairly easy to develop, in fact it leads to several problems related to rule based inference.
These known expert systems represent the declarative knowledge about the application domain and the procedural or problem solving knowledge about how to organize the reasoning process in a mixed representation, i.e. rules. In this manner, the knowledge which is incorporated in the rules of these expert systems cannot easily be shared or reused for other application domains, which makes the development of such systems a time consuming task. Further, the validation of their knowledge base is difficult because knowledge about the application domain is scattered throughout the rule base. A practical limitation of rule based systems is the complexity of maintaining these systems with a large number of rules.
A major shortcoming of the existing technologies disclosed above is generally caused by the “semantic interoperability of data problem”, described by Heflin and Handler in “Semantic Interoperability on the Web”, Extreme Markup Languages 2000, pp. 1-15, as “the difficulty in integrating resources that were developed using different vocabularies and different perspectives on the data”. However, in this definition the semantic interoperability problem is limited to the fact that almost any data storage environment uses its own storage scheme and its own unique set of keywords to structure the data.
But according to many other authors, the “semantic interoperability problem” is even more complex because it relates to the difficulty of integrating data of types as different as text data, sound data, video data, image data, measurement data, control data and olfactive and tactile data in such a way that the knowledge they carry can be represented, used and brought together in a uniform way without a significant loss of meaning.
Solutions to resolve the semantic interoperability problem developed till now can easily be categorized into two main types.
A first set of essentially linguistic solutions trying to resolve the semantic interoperability as it is defined by Heflin and Handler. These solutions like DAML, OML, CKML essentially focus on resolving the linguistic problem of integrating different keyword sets into a usable framework. In these solutions concepts are defined as a superset of “abstract” keywords under which the individual keywords out of different sets can be organized. The concepts can then be used to execute search operations or document categorization and clustering operations. The most performing solutions in this set use concepts defined in the way described above and provide a set of “ontological relations” in order to link the concepts together into a structured ontology. These structured one-dimensional ontologies are then used to conduct the search operations or document categorization and clustering operations and to extract information out of text information sources.
Information extraction driven by these structured ontologies consists in most cases of a technique that tries to identify concepts in a given text source, by scanning the text for the presence of keywords defining these concepts. In a second step it is tried to establish links between the concepts. Often this is done by a analyzing the syntactic structures of the sentences in the text source in an attempt to detect syntactic dependencies that can be mapped onto the ontological relations established between the concepts or by using probability calculations to establish which of the ontological relations is most likely to link the concepts co-occurring in the text. The result of these operations is supposed to represent the linguistically expressed knowledge in the text into a formal ontology based structure that can be used to drive search operations or document categorization and clustering operations in a more efficient way.
Because of the fact that these systems try to drive free text searches and to yield a formal result they are often the to solve the “language to knowledge bottleneck” which is another way to describe the semantic interoperability problem in the definition of Heflin and Handler. An inconvenience of these systems is that they are not suited for analyzing and integrating other than linguistic, or textual data and that they do not provide internal mechanisms to fully exploit the formally represented information they contain. For this full exploitation, being for example delivering decision advice or formulating complex queries the ontologies have to be coupled to external not fully compatible inference languages as for example OIL.
A second set of solutions essentially focuses on resolving the problem of integrating data of different types with each other. Most often these solutions concentrate on a flexible storage environment often called a multi media database wherein the different data types can be stored. The link between the different data is than made by the fact that the data are organized according to a common set of (manually assigned) keywords. The most performing between these systems provide an ontology like keyword set without offering the possibility of linking data with relations because of the fact that non-linguistic data do not present syntactical information.