1. Field of the Invention
The present invention generally relates to database systems and, more particularly, to a process for creating and maintaining ontologies and processes for semi-automatically generating deductive databases, especially for biological information systems.
2. Background Description
Database engineering practices and technologies of the last two decades have proven a poor match for the complex information handling and integration needs of modern enterprises. These techniques and technologies center on the construction of databases which are manually designed by a team of developers. This not only includes the database code itself (schemas, views, and integrity constraints) but also includes the peripheral software needed to run such a system: data loaders or xe2x80x9ccleanersxe2x80x9d, application software, and other computer resources.
For very simple or highly standardized domains, this approach is sufficient. Simple domains require a simple database schema and few integrity constraints on the data. Such domains change very slowly over time, making it easy for developers to keep up with the design requirements. However, many problems faced by modern database customers don""t fit these criteria. For instance, systems involving any sort of analytic component typically require extremely complex and fluctuating rules reflecting real-world situations. Using current techniques, the process of keeping such information systems current is error-prone and prohibitively expensive, costing millions of dollars in developer salaries alone over the system life cycle.
Moreover, current systems have a fundamental and more severe problem: integrating data from any two systems requires custom-made middleware, because it is impossible for the system to xe2x80x9cunderstandxe2x80x9d the content of the participating databases well enough to perform the required integration automatically. The use of a shared ontology to enable semantic interoperability of existing databases and other software is gaining acceptance. It is possible to enable communications between two systems by mapping the semantics of independently developed components to concepts in an ontology. In the computer sciences, an xe2x80x9contologyxe2x80x9d refers to a conceptual model describing the things in some application domain (e.g., chemistry) encoded in a formal, mathematical language.
In the context of the invention, an ontology is a formal (concretely specified) description of a business domain. It contains a taxonomy of concepts (xe2x80x9ca person is a type of mammalxe2x80x9d; xe2x80x9ca corporation is a type of legal entityxe2x80x9d), and also contains a set of rules relating those concepts to each other (xe2x80x9cflight numbers are unique within airlines over timexe2x80x9d). Data element standards and metadata repositories and their associated tools formalize some (but not all) system behavior, leaving the rest to be specified in free-form English text which cannot be xe2x80x9cunderstoodxe2x80x9d automatically. Ontologies, on the other hand, represent these concepts and rules in a completely formal language; their meanings are meant to be accessible to the computer. Unfortunately, ontologies are specified using languages which are far too powerful to allow their being used in a straightforward manner to build practical information systems, until development of the present technology.
Generating application-focused databases from large ontologies is described by Brian J. Peterson, William A. Anderson and Joshua Engel in Knowledge Bus: Generating Application-focused Databases from Large Ontologies, Proceedings of the 5th KRDB Workshop, May 1998 (hereinafter, Peterson et al.) and herein incorporated by reference in its entirety. In their paper, Peterson et al. propose to generate the databases (including application program interfaces (APIs)) directly from focused subsets of a large, general purpose ontology. By extracting only a subset of the ontology needed to support representation and reasoning in a focused application domain, the resulting systems are smaller, more efficient and manageable than if the entire ontology were present in each system.
The subject invention builds on the work of Peterson et al. According to the invention, there is provided a process for creating and maintaining ontologies and a process for semi-automatically generating deductive databases (DDBs). The ontology is a Ontology Works language (OWL) ontology managed by the Ontology Management System (OMS). An OMS ontology has a hierarchy of categories, which denote classes of objects (note that this is different from the object-oriented notion of class). This hierarchy is partitioned by the type and attribute hierarchies. The type hierarchy includes the categories that can participate in predicate signatures, and corresponds to symbols that become types within a generated database. The OMS ontology consists of a set of OWL sentences, each of which has an associated conjunctive normal form (CNF) version. The deductive database generator (DDBG) applies a series of conversion and review steps on the CNF of the OWL sentences within the input ontology. It generates a pre-DDB, which defines the schema of the deductive database, as well as provides the rules required for reasoning the integrity-constraint checks. A Strongly-Typed API Generator (STAG) takes the pre-DDB and generates a Java-based API for the resulting DDB. This API is a strongly typed, object-oriented view of the elements defined in the pre-DDB. The DDB consists of a pre-DDB with a Java server and a backing store.
The following sections describe the process used to generate databases:
The extraction phase starts with those entities and relationships immediately relevant to the problem at hand, and identifies those parts of the ontology necessary to support them. For example, a dinosaur taxonomy is not relevant to a database supporting financial analysis of automobile exports, but concepts relating to products and international economics are. The set of immediately relevant concepts may be already present in the ontology, entered by hand by the database designer, or automatically derived from existing database schemas.
The translator builds a database whose schema implements the structure given by the extracted portions of the ontology. In addition, it generates view and constraint definitions which implement the semantics of concepts in the ontology with perfect fidelity and high efficiency. The user can guide the translator to omit some details for improved performance.
The database is exposed through a Java API. The API provides a simple object-oriented view of the ontology which will be familiar to all Java programmers. The API also provides a relation-based view for more sophisticated queries. Both enforce strong typing rules, which improves program correctness, makes programs easier to reuse, and speeds program development.
A deductive database (DDB) is about as close as databases are ever likely to get to ontologies, and translating from (part of) an ontology to a DDB requires, in general, the least loss of information. This is why it was decided to develop a translator for a DDB first, before a relational or object-oriented data model. The core of the deductive database is XSB, a main memory deductive database system developed at the State University of New York, Stony Brook. XSB itself lacks many features found in traditional database systems. To compensate for this, OW provides a database server built on XSB which provides transaction and recovery services, while taking advantage of the query processing efficiency of the DDB.
The system and method of the present invention include various improvements and variations over the system described by Peterson et al. In particular, the system according to the invention has both conceptual and implementation improvements over the Peterson et al. system including, but not limited to, those improvements described below.
Conceptual improvements were made to the ontology used in the processes of the present invention, as well as in the databases and APIs generated:
1. Specialized Ontology
The present invention uses a specialized ontology in its generation processes. The Ontology Management System (OMS) ontology has the expressive power of a general purpose ontology, but has mechanisms and methodologies oriented towards using the ontology for the automatic generation of databases.
Using such a specialized ontology makes the translation processes simpler, more maintainable, more reliable, and results in better, more efficient databases (both deductive and non-deductive databases).
2. Uses Well-Founded Semantics (WFS) for the Ontology
The WFS was developed as a natural declarative semantics for general logic programs (those that allow for negated subgoals). Its use as an ontological semantics is novel and has many advantages. Because WFS has a very intuitive interpretation for negation and recursion (unlike classical semantics, even with non-monotonic extensions), it is much easier to use and to reason over. A considerable simplification is that a non-monotonic extension is not necessarily since WFS is equivalent to the major non-monotonic formalisms like Belief Logic and Circumscription. WFS is a very good semantics to use for deductive databases. Using WFS for the ontology that generates such databases makes the translation process much more effective and efficient. There is much less difference between the generated database and the originating specification (the ontology).
3. Specialized Type Hierarchy
The OMS restricts its notion of a type from the notion used in other general purpose ontologies (like Cyc): A type is a property that holds for an object for all time, i.e. a necessary property versus a contingent one. This distinction allows for a better correlation between the types in a generated database and the types in the originating ontological specification.
The set of types associated with objects in object-oriented databases and programming languages are usually static, meaning that an object does not lose nor gain types. Also, method signatures consist entirely of these types. The OMS has a similar notion of type and (flnctor) signatures, and so there is a much better correlation between the specification (the ontology) and the generated database.
4. Unary Type-checking Predicates
Unlike ontological systems like Cyc, the OMS in accordance with the present invention adds a unary predicate for each declared type that is used to check for that type. In the OMS there are two ways to check if the symbol b has the type person:
(isa b person)
(person b)
The first way, isa/2, is the same as in the Cyc system; the OMS allows for the second way as well, the person/1 predicate, in order to prevent locking bottlenecks in Generated databases. If isa/2 were the only way of checking for a type, then most predicates would depend on isa/2. When the rules are manifested in a databases, this dependency will result in most of the database being locked whenever a new symbol is added to the database (because this update requires asserting a type for the new symbol, which is a modification of the xe2x80x9cisaxe2x80x9d table).
In the OMS, the isa/2 predicate is used for type-checking only when the type being checked for is not statically known, and the unary predicates are used when the type is statically known. For example, the rule
(= greater than (and (p ?X)(isa ?Y ?X)) (q ?Y))
uses isa/2 because the target type is only known at run-time and not at the time that the rule is added to the OMS.
5. Temporal/non-temporal Predicates
The OMS of the present invention differentiates between predicates that are time dependent and those that are not. This distinction allows the translator to programmatically restrict the application of the temporal-model in generated databases to those predicates that truly depend on it.
6. Dropped Dynamic Class Creation
The generated API (generated by the STAG) does not allow for dynamic class creation, making the API simpler, much more efficient, and easier to use.
7. Added Equality Reasoning
The OMS and KBDB""s can now perform equality reasoning. This implements the intended behavior of the OWL xe2x80x98=xe2x80x99 symbol. This allows users to assert equality facts and have the system use them when processing queries. For example, asserting that
(=fred (fatherOf joe))
allows the system to correctly answer the question
(likes ted fred)
if it knows that
(likes ted (fatherof joe)).
The factual assertions retain their original information, so that if the equality information were later retracted, the original information is not lost.
Along with the conceptual improvements, there are many significant implementational improvements in accordance with the present invention that make the generation processes more effective and the generated databases more efficient and more powerful.
1. VVFS Temporal Model
A WFS implementation of the temporal model was developed. This gives users the option of generating deductive databases with a non-stratified rule set.
2. Subgoal Reordering
A module for reordering subgoals was developed. This module can handle non-recursive rule sets as well as recursive ones. The addition of this module makes the generation processes reliable and repeatable, decreasing the time required to generate a usable database from weeks, as required by the system described in Peterson, et al., to hours. The resulting database is also more efficient since the entire rule set was optimized with respect to subgoal reordering (versus the sample-query approach taken in Prior System).
3. Rule Optimization
The modules that optimize recursive and non-recursive rule sets (in addition to the subgoal reordering) is a very significant improvement over the Peterson et al. system. These components result in much more efficient databases.
4. Function Symbols
The Peterson et al. system could not handle function symbols, whereas the system according to the present invention can.
5. Integrity Constraints
5.1 IC Module
The system according to the present invention adds an integrity constraint module to generated databases, whereas the Peterson et al. system had none what so ever.
5.2 IC Dependencies
Each integrity constraint (IC) will have a set of updates that it depends on, where if such an update occurs, then the IC needs to be checked. The dependent bindings can be propagated along the dependency graph when computing these update dependencies, which can be used to partially instantiate the IC calls required for that update.
6. Extensional Database (EDB) Rules
The system according to the present invention does not have to add the many extra binding-pattern analysis rules that the Peterson et al. system had to. Such analysis was pushed down to special rules that were added for each predicate that could have an asserted extent (one per extensional predicate). This reduced the factor of eight increase in the number of rules that the Prior system had to less than a factor of two (because not every predicate could have an asserted extent).
7. Uses DDB for OMS
The OMS is implemented as a KBDB deductive database application, using the KBDB for persistent storage and to perform inferences and integrity-constraint checks (these checks are over the OMS rules themselves as well as asserted facts). The OMS uses a variant of the DDBG to update the rule set. Using a KBDB for the OMS gives the OMS many characteristics of a database, such as transactional update. It also gives the OMS the efficiency of a KBDB, allowing it to be used in operational settings.
In a specific application, the invention has been used to construct a database for biochemical pathway information. This model included information ranging from the genome through transcription, translation to proteins, through the roles of those proteins in reactions and pathways. The database was populated with information on the chemical pathways from the bacterium Mycoplasma pneumoniae and exhibited unprecedented capabilities in supporting analysis and visualization applications.
Some examples of application domains for which the invention provides specific advantages include drug delivery, combinatorial chemistry and automated database curation. In the application domain of drug delivery, the biochemical pathway system can be enhanced to build complex human-guided and automated analysis tools for drug discovery. By providing detailed information on the function of pathways in terms of their genomic origins, spatial and chemical properties, programs can be used to automatically identify likely compounds for further analysis. In the application domain of combinatorial chemistry, the ability of ontological models to express complex chemical properties and incorporate results from the philosophy of chemistry can aid in the discovery and specification of powerful constraints that help predict the outcomes of complex reactions and aid in analysis of results. This capability will be particularly important when working with large molecules (the kind typically found in biochemistry) that exhibit emergent properties that are not obviously reducible to the basic properties of physical chemistry. In the application domain of automated database curation, unlike conventional database model, the ontology forms a logical basis for the curation of database entries. It can provide explanations for why conflicting entries actually conflict, and provide guidance to database curators to identify and correct sources of error.
Ontologies allow more complete understanding of chemical reactions because they facilitate integration of important contextual information into the representation of experimental results. For example, if one starts with a high-level ontology that includes a theory of topological relations (such as xe2x80x9cinsidexe2x80x9d, xe2x80x9coutsidexe2x80x9d, xe2x80x9cconnectedxe2x80x9d, xe2x80x9ccontained inxe2x80x9d, xe2x80x9cimpermeable boundaryxe2x80x9d, xe2x80x9csemipermeable boundaryxe2x80x9d, etc.), it becomes possible to represent the locations of chemicals within a cell and to express such as:
In most animal cells, sodium ions are present in higher concentration in the medium exterior to the cell than interior to the cell.
This ionic gradient across the membrane is maintained by a transport system whose components are located within the membrane.
This gradient can be maintained only if sufficient levels of ATP are present to drive the transport system.
The transport system is specifically inhibited by cardiotonic steroids; therefore, the gradient cannot be maintained if these steroids are co-located with the transport system components. In most databases, information such as this can only be represented as textual comments, which are difficult or impossible to interpret consistently and analyze automatically. By formalizing such contextual information as is usually found in comment fields (e.g., location of reaction within a cell, type of cell, tissue type, species, age, phenotype, temperature, protocol followed), much more automatic analysis and comparison of experimental results is possible.
Formalization of molecular structure can also lead to insights on function. Turning knowledge of a bimolecular sequence into a formal representation of that molecule""s structure is a major challenge for bioinformatics, and is a percursor to achieving full understanding of molecular function in the context of complex organisms. Ontologies enable the formal representation of the structural and functional characteristics of molecules, leading to improved evaluation of molecules for target structures. For example, queries such as xe2x80x9cFind all molecules with a predicted heme-binding region located on the interior of the protein""s predicted shape, where the interior consists mostly of nonpolar residuesxe2x80x9d becomes possible.