This specification relates to data processing, in particular, to processing recursive statements.
Some query languages support recursive statements, which are statements that reference their own output. Example query languages that support recursive statements include Datalog and SQL.
Query languages operate on relations. A relation is a set of tuples (t1, . . . , tn), each tuple having n data elements t1. Each element t1 represents a corresponding value, which may represent a value of a corresponding attribute having an attribute name. Relations are commonly thought of as, represented as, and referred to as tables in which each row is a tuple and each column is an attribute. However, a relation need not be implemented in tabular form, and the tuples belonging to a relation can be stored in any appropriate form.
The following query language expression, expressed in Datalog-like pseudocode, is an expression having a recursive term:
ancestor (x, y):-parent (x, y) OR                exists (z: parent (x, z) AND ancestor (z, y))        
This statement recursively defines what it means for someone to be an ancestor according to tuples belonging to a “parent” relation that defines parent/child relationships.
The statement indicates that a person x is an ancestor of a person y if (1) the person x is a parent of the person y, according to the “parent” relation, or if (2) there is an intervening person z such that the person x is the parent of the person z according to the “parent” relation and the person z is an ancestor of the person y according to the “ancestor” relation. This statement recursively defines tuples belonging to the “ancestor” relation because the definition of “ancestor” depends on its own output.
As used in this description, an expression is a query language statement having one or more terms. A term includes parts of a program that can be joined by conjunctions and disjunctions. Terms therefore can include predicate calls, existentials, and universals, to name just a few examples.
The :- symbol is an assignment operator that assigns the predicate expression ancestor(x,y) to have the body “parent(x,y) OR exists(z: parent(x,z) AND ancestor(z,y)).” The second term in this definition has an existential quantifier, represented by the “exists” operator. A term having an existential quantifier may be referred to as an existential term. This existential term asserts that there is a person z such that a person x is the parent of the person z and that the person z is an ancestor of a person y.
The semantics of this statement in a query language having fixed point semantics is that the body of ancestor(x,y) is evaluated to compute an associated “ancestor” relation for the predicate. The associated ancestor relation can be defined by computing the least fixed point of the statement. A fixed point for a statement is a set of tuples that, when provided as input to the statement, reproduces the set of input tuples. In other words, a fixed point exists when a set of output tuples generated from a set of input tuples is identical to the set of input tuples.
The least fixed point is a fixed point having tuples that are contained within all other fixed points for the statement. In this example, the least fixed point ends up being a set of tuples (x,y) that make the definition of ancestor(x,y) true for all available input data. Evaluation engines that evaluate recursive predicates can compute a least fixed point for a statement by providing the empty relation as input to the statement and then repeatedly providing the result to the statement. When no additional tuples are generated, the least fixed point has been reached.
A related result is the greatest fixed point. The greatest fixed point is a fixed point having tuples that contain all other fixed points for the statement. A greatest fixed point can be computed by providing the set of all tuples as input to the statement and then repeatedly providing the result to the statement.
The semantics of query languages often means that after computing tuples that belong to a particular relation for a statement, e.g., the “ancestor” relation, when actual tuple values are provided as input to the statement, the statement evaluates to true if the input tuple occurs in the relation and evaluates to false otherwise. In other words, “ancestor(x,y)” means the set of all ancestor pairs, while “ancestor(“Bob”, “Ted”)” means that the tuple (“Bob”, “Ted”) is in the “ancestor” relation. For example, after computing tuples that belong to the “ancestor” relation, when an actual tuple value having values for x and y is provided as input to the “ancestor” predicate, the statement evaluates to true if the ancestor relation has a tuple with the values for x and y. The statement evaluates to false otherwise. Evaluation of predicates is typically performed by an evaluation engine for the query language implemented by software installed on one or more computers.
The relation for which a predicate is evaluated may be specified within the body of the predicate, by tuples in another relation, or some combination of these. In this example, the body of ancestor(x,y) itself indicates that some tuples will be specified by the “parent” relation. However, other tuples will be deduced by evaluating the recursive statement from the body of the predicate.
Some methods for finding fixed points of a recursive statement recast the statement into a number of nonrecursive evaluation statements. The evaluation statements are then evaluated in sequence until a least fixed point is reached. In general, recasting a recursive statement into a number of nonrecursive evaluation statements may be referred to as “flattening” the recursion.
An evaluation engine for a particular query language can recast a recursive statement as follows. A first nonrecursive statement is defined as being the empty relation. A sequence of subsequent nonrecursive evaluation statements are defined according to the body of the recursive statement. In doing so, the evaluation engine can replace each recursive term with a reference to a previous nonrecursive evaluation statement. Logically, the number of nonrecursive evaluation statements that can be generated is unbounded. However, the evaluation engine will halt evaluation when a fixed point is reached.
The evaluation engine then evaluates the nonrecursive evaluation statements in order and adds any resulting tuples to the associated relation for the statement. The evaluation engine stops when a nonrecursive evaluation statement is reached whose evaluation adds no additional tuples to the associated relation. The final result is the associated relation for the recursively defined statement.
When using this strategy for recursive evaluation, evaluating each successive evaluation statement regenerates all of the results that have already been generated. Because of this inherent duplication, this approach is sometimes referred to as “naive evaluation.”, As the associated relation for a recursive statement grows larger, each iteration requires more time and effort to compute because every iteration recomputes all of the previously computed results.
To implement naive evaluation for the “ancestor” predicate above, an evaluation engine can recast the “ancestor” predicate into the following nonrecursive evaluation predicates:
ancestor0(x,y):-{ }
ancestor1(x,y):-parent(x,y) OR exists(z: parent(x,z) AND ancestor0(z,y))
ancestor2(x,y) parent(x,y) OR exists(z: parent(x,z) AND ancestor1(z,y))
. . .
Or, for brevity, the evaluation predicates may be represented as:
ancestor0(x,y):-{ }
ancestorn+1(x,y) parent(x,y) OR exists(z: parent(x,z) AND ancestorn(z,y))
At first glance, this notation may look like a recursive definition, but it is not. This is because the subscripts of the predicates denote different nonrecursive predicates occurring in a potentially unbounded sequence of evaluation predicates. In other words, the predicate ancestor+1 is not recursive because it references ancestorn, but not itself.
The evaluation engine then evaluates the nonrecursive predicates in order to find the least fixed point. One primary drawback with naive evaluation is that each time ancestorn(z, y) is evaluated, the system reproduces all the tuples that have ever been generated. This effect is a notorious computational bottleneck in both time and space.
Another prior art procedure for evaluating recursive statements is often referred to as “semi-naive evaluation.” When using semi-naive evaluation, an evaluation engine flattens the recursion of a recursive statement in a different way than for naive evaluation. In particular, the evaluation engine defines a “delta predicate” whose associated relation is defined to include new tuples generated on each iteration. The least fixed point is then found when an iteration is reached in which the delta predicate's associated relation is empty. In other words, when the delta predicate is empty, no new tuples have been generated, and the least fixed point has been reached.
To evaluate a recursive statement having N recursive calls, an evaluation engine generates N different evaluation predicates. Each of the N evaluation predicates replaces a different recursive call with a call to delta relation. The overall delta relation is then defined as a disjunction of all N evaluation predicates.
For example, an evaluation engine can use semi-naive evaluation to recast the ancestor definition into the following evaluation predicates:
Δancestor0(x,y):-{ }
ancestor0(x,y):-{ }
Δancestorn+1(x,y):-(parent(x,y) OR exists(z: parent(x,z) AND Δancestorn(z,y))) AND NOT ancestorn(x,y)
ancestorn+1(x,y):-ancestorn(x,y) OR Δancestorn+1(x,y)
In this example, the delta predicate also includes an explicit “AND NOT” term that defines the delta relation to exclude any previously generated tuples.
Therefore, semi-naive evaluation tends to produce fewer duplicate tuples compared to naive evaluation. However, semi-naive evaluation is defined only for a very restricted class of programs, in particular, programs that contain only disjunctions of conjunctions of predicate calls.
In addition, in practice, computing the “and not” term in Δancestorn+1(x,y) for semi-naive evaluation can be extremely expensive for any data set of considerable size. As the associated relation grows, computing what is not in the relation becomes a more computationally expensive task because each newly generated tuple must be checked against every other tuple in the relation to determine whether or not it has already been generated.
And even worse, semi-naive evaluation is not even defined for more complex programs. This means that for some expressions, semi-naive evaluation cannot be used at all. In contrast, because naive evaluation simply recomputes all the previously generated tuples all the time, this approach is computationally sound because all the required information is always available to be used to compute the correct result.
In sum, naive evaluation is very computationally inefficient in time and space. While semi-naive evaluation provides some computational benefits, it is not defined for all expressions.