Metadata is widely used to describe data volumes across their different structures and sources for the purpose of meaningful processing. As metadata is considered all kinds of formal description including concept and rule languages (such as UML, ORM, logic, RuleML, etc.) that express knowledge about the plurality of data elements which, at least in parts, can be mapped to an ontology language such as, but not limited to, the Resource Description Framework Schema (RDFS) or the profiles of the Web Ontology Language (OWL). Meaningful processing relates to all tasks that take entailments into account that are implied by the accompanied metadata (aka reasoning) within computation of, but not limited to, answers of type queries.
Within a typical business setting data is stored in databases and any semantic information system has to utilize some sort of reasoning to enable content-aware data processing. The reasoning systems for large volumes of data basically follow three approaches:
Query rewriting or backward-chaining approaches answer a query by compiling metadata (query-relevant knowledge) into a (typically SQL) query for execution by the database engine. This technique is commonly called ontology based data access (OBDA). A suitable metadata language that supports this approach is OWL 2 QL.
Materialization or forward-chaining techniques pre-compute all entailments upfront, independent of any queries. After pre-computation it is sufficient to evaluate queries over the materialized data to obtain all entailed results. Hence in this context, materialized data means precomputed data and materialization refers to the precomputation of all entailments. OWL 2 RL is an ontology language that is sound and complete in this respect.
Combined approaches follow a mixed strategy where some of the entailments are materialized in advance or on demand while others are triggered by queries and computed just for the purpose of a particular query.
With increasing data volumes existing technical solutions of the aforementioned approaches are unable to meet practical requirements in terms of metadata expressivity, performance, storage space, or memory. To name a few: Query rewriting is limited to rewriteable and less expressive metadata languages and requires sophisticated query-optimizations to work well in practice. Straightforward full materialization easily ends up in time and space consuming pre-processing that typically has to be repeated when data changes. Combined approaches need to be adjusted and tuned for each case of application in advance.
The present technology relates to the field of Description Logics (DL) so that common DL terminology is used throughout the description as follows: ABox (the data), refers to the aggregate of data elements called individuals, in terms of asserted or inferred concept assertions (also called types) as well as role assertions. Role assertions express a directed relationship that relates one individual (the source) with a second individual (the target) with respect to a particular role. TBox refers to the aggregate of schema axioms about concepts and roles (the metadata).
U.S. Pat. No. 7,904,401 presents a method and apparatus, including computer program products, for scalable ontology reasoning. A method of generating a summarized ontology includes, according to U.S. Pat. No. 7,904,401, loading an ontology from a store, eliminating relationships in the ontology, the eliminating relationships including an insertion of new relationships that simplify the ontology, eliminating individuals in the ontology, the eliminating individuals including insertion of new individuals to simplify the ontology, eliminating concepts in the ontology including insertion of new concepts to simplify the ontology, and generating the summarized ontology from the eliminating relationships, eliminating individuals and eliminating concepts. U.S. Pat. No. 7,904,401 does not perform materialization.
Fokoue, Kershenbaum, Ma, Schonberg, and Srinivas: “The summary ABox: Cutting ontologies down to size”, in. Proc. of the 5th Int. Semantic Web Conference (ISWC 2006), Vol. 4273 LNCS, p. 343-356, Springer, 2006, presents an approach that merges similar individuals to obtain a compressed, so-called summary ABox, which is then used for (refutation-based) consistency checking. The approach is similar and closely related to the approach of U.S. Pat. No. 7,904,401. The technique is based on the observation that individuals with the same asserted types are likely to have the same entailed types. Since merging in Fokoue et al. is only based on asserted concepts, the resulting summary ABox might be inconsistent even if the original ABox is consistent w.r.t. the TBox. To remedy this, justifications, according to Kalyanpur at al: “Finding all justifications of OWL DL entailments”, in. Proc. of the 6th Int. Semantic Web Conference (ISWC 2007), Vol. 4825 LNCS, p. 267-280, Springer, 2007, are used to decide which merges caused the inconsistency and to refine the summary accordingly. Justification-based refinements are also necessary for query answering since Fokoue at al. and U.S. Pat. No. 7,904,401 do not perform query answering based on materialization but perform reasoning at query time. Such computation of justifications is very resource intensive, requiring significant processing and memory resources, and furthermore slows down the process of query answering. The computation of all justifications is typically part of the exponential Reiter's search according to U.S. Pat. No. 7,904,401. For large ABoxes, such as those which are used in the evaluation section below, the calculation of justifications according to U.S. Pat. No. 7,904,401 and also Fokoue et al., despite possible optimizations, may even be impossible on ordinary computer hardware, such as the one used in the evaluation section below, due to resource shortage. It is thus desirable to avoid the creation of a possibly inconsistent summary ABox in the first place and thus also to avoid the step of computing justifications altogether.
Wandelt and Möller present in “Towards ABox modularization of semi-expressive description logics”, Journal of Applied Ontology, 7(2):133-167, 2012, a technique for refutation-based instance retrieval over SHI ontologies based on modularization. As an optimization, this approach groups individuals into equivalence classes based on the asserted types of an individual, its successors, predecessors and the asserted types of the successors and predecessors. The assertions that define the equivalence class of an individual are used for finding sound entailments. For checking entailments that cannot be read-off from these assertions, it might be necessary to fall-back to (refutation-based) reasoning over the (possibly large) ABox module for the individual. This fall-back is however undesirable, since it requires in certain cases the processing on the basis of the original ABox, which would nullify the effect of possible improvements in terms of resource savings possibly achieved by the grouping of individuals.
Wandelt and Müller present in “Sound and Complete SHI Instance Retrieval for 1 Billion ABox Assertions”, Workshop on Scalable Semantic Web Systems, pp. 75-89, 2011, a technique for refutation-based instance checking over SHI ontologies based on modularization. The method relies on ABox modules called individual islands that are built for each individual using a syntactical splittability check based on the TBox information. Each individual island is a subset of the original ABox that contains at least all ABox facts necessary to compute all entailed concept memberships for this individual. It is sound and complete to use the island of an individual to check all its concept memberships instead of the original ABox. However, for exhaustive concept materialization of the ABox, the islands of all individual must be built and checked separately. Therefore individual islands provide no advantage in terms of a size reduction with respect to individuals or assertions. Yet, if two or more individuals have similar (isomorphic) islands, one could apply the results of the concept materialization for the first individual to the others instead of processing the individual islands of the other individuals. Unfortunately the similarity test between individual islands, which may consist of substantial parts of the ABox, can be too computationally intensive to be practicable since it is a graph isomorphy problem for which no polynomial algorithm is known for the general case as of today. Wandelt and Möller in “Sound and Complete SHI Instance Retrieval for 1 Billion ABox Assertions” therefore defined an approximation of the individual islands called one-step nodes, grouping individuals into equivalence classes based on the asserted types of an individual, its successors, predecessors and the asserted types of the successors and predecessors. The assertions that define these equivalence classes are then used for computing sound entailments for their member individuals. However, this approximation is only complete if the one-step node is splittable, wherein splittability implies that the island of this individual is included in the one-step node, what can only be the case for very small islands. In each case where the one-step node is not splittable, the equivalence classes have to be discarded and the bigger island of each individual must be used. Moreover, the approach of Wandelt and Möller is not compatible with the use of nominals which are widely used in real-world ontologies. Individual islands and splittability depend on the TBox. As soon as a concept assertion for an individual can possibly be used to infer new assertions for second individual that is not a “neighbor”, the one step node for the second individual is not splittable. Hence, the bigger the TBox, i.e. the more complex the ontology, the lower the chance that the one step node is splittable.