1. Field of the Invention
The present invention generally relates to the field of Logic Programming and Database Processing Engines and, more particularly, to the processing of complex conjunction-of-constraint queries when large factbases are involved.
As the amount of data that computers can store and generate has increased, the need to perform complex flexible queries has become vitally important. Large databases by the very nature of their size are limited in their usefulness unless there is some way to home in on the required data. Flexible queries can hone the results produced to what is necessary for a specific purpose, thereby increasing the value of the underlying data. Moreover, flexible queries can be utilized to bring together information from various separate database files to produce new information. These flexible queries can be expressed by defining the data and the constraints to which this data must adhere in the form of a conjunction of constraints. This conjunction of constraints and data it queries can be written in many different manners of which the most general comes from the field of Logic Programming.
2. Description of the Related Technology
The basic approach in Logic Programming is to describe a given domain of knowledge through the assertion of facts and rules. The asserted facts and rules can be queried, whereupon a logic engine deduces each answer that can be proven from the facts and rules asserted. The user may request any number of answers from one to all possible answers. The result of the query is new information based on the facts, rules and the specific query. This is in contrast to conventional approaches where "data" is processed via preformulated and programmed algorithms. The focus of the invention is on querying facts. Therefore, further discussion will concentrate on queries not involving rules. The inclusion of rules, which are generally of a limited number, can be handled through other methods in combination with the invention.
In order to better describe the invention, the following terms are described at the outset:
Predications--A logic program's facts and queries are built from simple sentences called predications. A predication is made up of a predicate followed by a subject having n terms. At minimum, a given term can be an individual constant or an unknown (also called a variable). The form of a predication is EQU (OWES.times.Mary 1000) (P1)
In predication (P1), the predicate OWES has a 3-tuple subject, whose first term is an unknown, named x, whose second term is the individual constant Mary, and whose third term is the constant 1000. In this case the predication (P1) is read "Someone owes Mary 1000 dollars."
A predication that contains one or more unknowns is open; otherwise it is closed. An open predication is closed by instantiating each of its unknowns with an individual constant. Only closed predications can be logically evaluated to True or False.
Facts--Facts are predications that contain no unknowns. i.e. they are closed predications Facts are made up of a predicate followed by n individual constants, where n is the arity. For example, the fact (F1), with arity 2, states that John is the parent of Mary. EQU (PARENT John Mary) (F1)
Queries--Queries are formulated by specifying a desired answer template of unknowns and a conjunction of constraints. A Conjunction of constraints, or more simply a conjunction, is composed of one or more predications which constrain the possible values of the answer template. The skeleton of a Query is presented below: EQU (ALL (x y . . . n) A&B&C)
where A, B, C are open predications, that contain the unknowns x, y, . . . , n and possibly other unknowns This query is read: return all sets of values for (x y . . . n) which cause A to be true, B to be true, and C to be true. The ALL requests the machine to find all query answers that can be deduced from the asserted facts. It is also possible to specify only the first answer(s) found. The predicate of every predication in the constraints is assumed to be the predicate of some set of asserted facts.
The unknowns in a query act as place holders for individual constants. Each unknown is designated by some name (here beginning with a lower case letter). The names themselves play no role, other than when the same name is used more than once within the constraints of a query, it indicates that, in each instance these places are to be instantiated with the same individual constant, i.e., to instantiate an unknown is to replace each and every occurrence of that unknown in a given query with the same individual constant. Different unknowns may be instantiated with different (or, for that matter, the same) constants.
As an example:
______________________________________ (Q2) (ALL (w x y) (PARENT w z) (PARENT z x) (OWES x w y) (.gtoreq.y 1000)) ______________________________________
The query (Q2) seeks to find all grandchildren, "x," who owe their grandparent, "w," a sum, "y." greater than or equal to 1000 dollars. It can be read, "Find all values (w x y) such that there exists some z where w is a PARENT of z, z is a PARENT of x, x OWES w y dollars, and y is greater than or equal to 1000." The query has four constraints, two with predicate PARENT, one with predicate OWES, and one with the predicate.gtoreq..
The ability to assert specific facts and pose flexible (and previously unanticipated) queries is very useful. The Logic Programming field, and to a lesser degree Relational Database Processing, have attempted to achieve this. Logic programming uses conjunction of constraints to express both direct queries on the asserted facts, as well as in rules which allow a hierarchy of queries. Most Logic Programming implementations are based on an approach called the Warren Abstract Machine (WAM) which was described by D. H. D. Warren Relational Database languages have constructs which can express a conjunction which are carried out via a combination of multiple joins, selections, projections, and intersection operations. The solution of a conjunction-of-constraints query in relational database languages relies most heavily on, and is limited by, the processing of multiple joins. Prior art given below is drawn from these two areas: the WAM and database join techniques.
Warren teaches that conjunction-of-constraint queries may be processed through the use of "unification" (pattern matching) and "backtracking" search strategies of an AND-OR tree. The Warren method can be used to solve any conjunction of constraints, including one with rules, but the time required to do so can be prohibitively high. In the worst case, the time required to process a query grows exponentially with the number of the constraint predicates. The exact nature of this growth is quite complicated since the degree of the exponential explosion depends on both the query and predicate facts stored. This growth increases with the size of the fact files and the percentage of facts which pass the constraints.
It has also been proposed that indexes on the terms in a fact be used to both increase Relational database query processing speeds and improve the WAMs efficiency. Indexing is a technique frequently used to speed processing. An index is an auxiliary table that pairs key values with unique fact identifiers. The index provides direct rapid access to individual facts of a predicate. The shortcoming of this technique is that in order to provide rapid processing of arbitrary queries, all terms of a predicate would require an index. This would require excessive storage and would increase the time to add to or modify facts.
Knuth teaches that a Superimposed Code Word technique(SCW) be used to aid query processing. SCW is a special index which is formed by hashing each term of each fact predicate to produce a fixed length binary code word. The binary code words of the terms of a fact are then Ored together to create the SCW for the fact which, again with the fact's unique identifier, forms the index table entry. This SCW index can aid query processing. The SCW index also requires excessive storage and increased time to add or modify facts. However, more importantly, the effectiveness of SCW decreases, producing excessive false drops, if flexible queries are to be allowed since every term permitted in the queries must be used to form the SCW. The false drop problem requires that answers be verified in the actual database. This causes rapid degradation of performance as the number of potential answers grows.
Berra teaches that a Concatenated Code Word technique (CCW) be used to aid query processing. CCW is similar to SCW except the binary code word of each term now varies in length depending on the redundancy of the term. These varying length binary code words for each term in a fact are then appended to each other rather than Ored to produce the index. The CCW index also provides an aid in query processing. It suffers from increased storage requirements and from the false drop problem.
Babb has proposed that a Content Addressable File Store (CAFS) be used to process queries. CAFS uses a hashing technique, hardware selection filtering, and single bit maps to record intermediate results to aid in the join and projection operations. The CAFS system does not consider the processing of conjunctions involving joins on more than one variable nor the possible filtering that can be performed based on variable interactions.