This invention relates to database management systems and methods and, more particularly, to intelligent database systems and methods that can deduce new information from available data.
Database management systems have been traditionally used in data processing applications such as payroll processing, financial accounting, and transaction processing. Now, database management systems are being increasingly put to use in new applications such as in expert database systems, and in response thereto new database management systems and machines are being designed.
Commonly, relational databases comprise a large number of entries and each entry is composed of a number of fields. The entries are sometimes referred to as records or tuples. Each tuple in a database generally consists of the same fields arranged in the same order, with the difference between the tuples being in the values of the fields. Information is retrieved from such databases through a relatively limited number of basic, or primitive, operations which may be combined to form very powerful queries.
One such primitive operation, for example, is "select" on a value of a prechosen field. This operation ignores all fields other than the selected field, searches in the selected field (through the tuples) for values that meet the specified condition, and retrieves the tuples that meet the search criterion (or a chosen portion thereof).
Another primitive operation is "union", where two databases which are defined over the same fields are concatenated to form a larger database. When the catenation is effected, some tuples may appear more than once, so part of the "union" operation deletes the redundant tuples. A "difference" primitive operation also considers two separate databases, but the result of the difference operation is a database that contains only the tuples that appear in the first database but no in the second database.
Still another primitive operation is known as the "join" operation. It permits a joinder of those portions of two databases that meet a chosen criterion. For example, when a join is specified on a particular field in each of two databases under consideration, those tuples in the two databases that match (in value) in the two chosen fields are concatenated to form new tuples which form a new database. Naturally, the two chosen fields should have values over a common range; otherwise no matches would be found.
A careful consideration of the above primitive operations reveals that much information can be gained by judiciously querying databases to extract the information that was explicitly inserted therein. But, databases also possess a lot of information that has not been directly, or explicitly, inserted. That information can also be extracted with the known primitive operations, but not very efficiently. For example, given a database with three fields where two of them represent cities that have a direct freeway linking them, and a third field giving the distance between the two cities along this freeway, one may wish to find a route to get from one city to another that is not directly linked to it by a freeway. Given only the conventional primitives, such a query can be answered only through a program that contains a large number of the primitive operations. If, however, a primitive operation were available that was capable of efficiently manipulating the database so that the query could be answered directly, many heretofore difficult tasks could be easily accomplished. The capability needed for the above example is generally referred to as "transitive closure". It is an often needed capability.
Many artisans have recognized the need to extend the set of primitive operations on databases, and some have suggested primitives that, in essence, are the transitive closure operation. Examples of these proposals are the "alpha" operator proposed by R. Agrawal in "Alpha: An Extension of Relational Algebra to Express a Class of Recursive Queries", Proc. 3rd Int'l Conf. on Data Engineering, Los Angeles, California, Feb. 1987; the "traversal recursion" operator proposed by A. Rosenthal et al. in "Traversal Recursion: A Practical Approach to Supporting Recursive Applications", Proc. ACM-SIGMOD 1986 Int'l Conf. on Management of Data, Washington D.C., May 1986; and the "*" operator proposed by R. Kung et al. in "Heuristic Search in Data Base Systems", Proc. 1st Int'l Workshop Expert Database Systems, Kiawah Island, South Carolina, Oct. 1984.
Before describing the methods and techniques heretofore employed in the art to achieve the transitive closure operation, it may be useful to present a pictorial representation of transitive closure.
As stated above, a database relation consists of a set of tuples, each of which consists of a number of fields. The values in each field are defined over a certain range. A compound field in a database consists of two or more fields considered together. Often, two fields (or two compound fields) in a database relation will be defined over the same range. When each value present in such two fields is represented by a node, and each tuple is represented as a directed link (or an arrow) that connects the two values found in the fields of the tuple, then the resulting graph (collection of nodes and connecting links) represents the database. The additional fields associated with each tuple in the database relation can be used to label the links in the graph. As an aside, the reason for using directed rather than undirected links is to distinguish between the two fields participating in the relation that were used to define the graph.
Turning attention to the graph itself, a "path" in such a graph consists of a set of consecutive links that can be followed to reach a selected destination node of the graph from a chosen source node of the graph. By our definition, the "transitive closure" of a graph is a new graph with a link between every pair of nodes between which there exists a path. Labels existing on the links of the original graph are transformed in accordance with a chosen function to form new labels for the links created by the transitive closure. Sometimes, even if a link already exists between a pair of nodes, the discovery of a different path could cause the label on such a link to be transformed according to some prescribed function. The transitive closure of a database relation is the database relation corresponding to the transitive closure of the graph of the original database relation given.
Much of the success of the past database systems can be attributed to the discovery of efficient methods for implementing various primitive operators for databases. Following the same lead, designers have attempted to design methods for computing transitive closure of a database relation. Most of these methods can be classified as iterative methods.
The semi-naive method, described by A. F. Bancilhon in "Naive Evaluation of Recursively Defined Relations", Tech. Rept. DB-004-85, MCC, Austin, Texas, starts with paths of length one and in each iteration finds paths that are one longer. The logarithmic method is described by P. Valduriez et al. in "Evaluation of Recursive Queries Using Join Indices", Proc. 1st Int'l Conf. Expert Database Systems, Charleston, South Carolina, April 1986, and by Y. E. Ioannidis in "On the Computation of the Transitive Closure of Relational Operators", Proc. 12th Int'l Conf. Very Large Data Bases, Kyoto, Japan, Aug. 1986. This method computes in each iteration all paths of lenght up to twice the lenght of paths already known. Variations on the above methods have been described by Y. E. Ioannidis in "On the Computation of the Transitive Closure of Relational Operators", Proc. 12th Int'l Conf. Very Large Data Bases, Kyoto, Japan, Aug. 1986, U. Guntzer et al. in "On the Evaluation of Recursion in Deductive Database Systems by Efficient Differential Fixpoint Iteration", Proc. IEEE 3rd Int'l Conf. Data Engineering, Los Angeles, California, Feb. 1987, and H. Lu "New Strategies for Computing the Transitive Closure of a Database Relation", Proc. 13th Int'l Conf. Very Large Data Bases, Brighton, England, Sept. 1987.
There are two major problems with the above prior art methods. First, the termination of the iterative methods depends on the longest path in the graph, and if the graph has long paths, many iterations are required before the results is obtained. Second, the iterative methods end up creating an enormous number of duplicates (because, when a path between a pair of nodes is discovered, it is not easy to verify whether a path between this pair of nodes already exists), particularly if the graph contains cycles, and incurs a heavy performance penalty in removing them.
Methods have also been devised for computing transitive closure of boolean matrices (that is, matrices with only 1 or 0 as their elements). Some well known methods of this type are the method described by S. Warshall in "A Theorem on Boolean Matrices", Journal of ACM, Vol. 9, No. 1, Jan. 1962, by H. S. Warren in "A Modification of Warshall's Algorithm for the Transitive Closure of Binary Relations", Communications of ACM, Vol. 18, No. 4, April 1975, and by C. P. Schnorr in "An Algorithm for Transitive Closure with Linear Expected Time", SIAM Journal of Computing, Vol. 7, No. 2, May 1978.
In accordance with the Warshall algorithm, given an initial v by v Boolean matrix of elements a.sub.ij over a v node graph, with a.sub.ij being 1 if there is an arc from node i to node j and 0 otherwise, one can obtain the transitive closure of the given boolean matrix as:
______________________________________ For k = 1 to v For i = 1 to v For j = 1 to v a.sub.ij = a.sub.ij .orgate. (a.sub.ik .andgate. a.sub.kj) ______________________________________
In graph terms, one can understand the algorithm as follows:
______________________________________ For every node k For every predecessor i of k For every successor j of k Make j a successor of i, and make i a predecessor of j. ______________________________________
If the graph is represented as a relation with each tuple representing an arc, one has to maintain the relation sorted two ways, to obtain the effect of having both predecessor and successor lists. The implementation of the Warshall algorithm then involves, for each node, fetching its successor and predecessor lists, and then for each successor (predecessor) fetching its predecessor (successor) list for possible update. It is also possible to keep only one list, say a successor list, and scan through all the lists each time to determine the predecessor of a node.
Warren noted that the Warshall algorithm involved fetching random bits from the Boolean matrix, and proposed a modification that would permit direct operation upon Boolean vectors without the overhead of bit extraction; but in two passes:
______________________________________ For i = 1 to v For k = 1 to i-1 For j = 1 to v a.sub.ij = a.sub.ij .orgate. (a.sub.ik .andgate. a.sub.kj) For i = 1 to v For k = i + 1 to v For j = 1 to v a.sub.ij = a.sub.ij .orgate. (a.sub.ik .andgate. a.sub.kj) ______________________________________
The only change in the Warren algorithm is that the i and k loops have been interchanged. However, this interchange could result in some paths being missed out and so the algorithm now requires two passes before it completes. The modification in the range of the second loop index is an optimization that reduces the cost of two passes. In each pass the computation is essentially:
______________________________________ For every node i For every successor k of i within some range For every successor j of k Make j a successor of i. ______________________________________
The problem with these methods is that they cannot be applied to efficiently compute the transitive closure of database relations unless the whole database and the result of the computation of the transitive closure can fit in the main memory of the computer. Unfortunately, most of the real life databases are too big for that.