1. Field of the Invention
The present invention relates to computer databases, and in particular to an improved method and system for storing and retrieving data having multiple symbolic or categorical dimensions, through use of a partial-order to support data mining and OLAP queries.
2. Description of Relevant Art
To model data on a computer one needs a simple data structure; to model information one needs a more complex schema, or schematic, of the types of information one intends to store. In a business database, schemas tend to explicitly name all the data fields (organized into tables) and to describe relationships between sets of fields (tables). In an AI knowledge database one needs first an ontology, or hierarchy of types, and then a schema based more or less on frames. As explained by Marvin Minsky in 1975: “We can think of a frame as a network of nodes and relationships.” Frames contain meta-information and also, at the lower level, slots to be filled with specific data instances. If one adds careful typing to the kinds of links which may exist between nodes, the result is a clearly defined semantic network such as was used in Ronald Brachman's KL-ONE project. Conceptual graphs are a fusion of semantic networks with a graphical form of logic developed by Charles Peirce at the end of the 19th century. CGs are suited to reasoning, especially reasoning with contexts.
The relational database model, introduced in 1970 by Ted Codd, is based on the notion that accuracy is best served by decomposing information into separate tables so that the only redundant data stored are keys. Most relationships exist only in potential, then, until a query is used to restore them via a JOIN operation. The mathematics for the model are formal and elegant, yet in practice the operations are computationally expensive, hence much effort is made to create and store indices to help speed them up. The model is not well-suited to handling whole-part or hierarchical relations and neither did SQL, its prevalent query language, support them well until recently.
A table may be viewed as a two-dimensional array, where one dimension distinguishes data records and the other selects different fields within a record. Fields may have a special value called “null” for cases where this datum is unknown or doesn't exist. Null values cause difficulties in both the formal relational model and in actual relational database systems. But in a pure CG database, not all graphs need be the same size, so null values do not appear to be a problem.
The multi-dimensional database (MDD), which provides the model or computational foundation upon which online analytical processing (OLAP) is based, derives from econometric research projects done at MIT in the 1960's. Its data structures store pre-computed statistical data to provide quick response for a real-time stream of queries from many simultaneous users. Typically these users are trying to make business decisions in areas such as marketing, budget analysis and quality control. In this model, data are broken into “meaures” (fields with quantitative, numerical values) and “dimensions” (fields with symbolic, categorical values). Arranging symbolic fields into the cross-tabular form seen in spreadsheets yields a 2-dimensional matrix from which charts may be plotted. One can also examine more than two dimensions—hence the term “data cube”. Each dimension is associated with a type hierarchy: typically a tree with all leaves at the final level. Whether the underlying data source is relational or multi-dimensional, much research into efficient computation of the CUBE operator has been done.
A feature of an MDD system is the ability to aggregate or “roll up” data from lower dimensions when creating summarizations at the higher levels. An example would be a trends report that gave dollars revenue for State and Vehicle Class on a Quarter by Quarter sequence, even though the data elements are stored by Dealership, Model and Month. Whereas relational systems are designed to handle business transactions which continually update the database, MDD and OLAP systems are optimized for querying data which is updated only at infrequent intervals. Often an MDD system for business applications is loaded with de-normalized (pre-joined) data from a relational database.
OLAP systems generally take the approach of pre-computing summary data for all possible cubes. The effect is to limit the practical number of dimensions (symbolically-valued fields) to less than ten. Even with few dimensions, they must compress their cubes to save storage costs by using sparse-matrix techniques.
3. Details of Prior Art
An essential issue with cube methods, and related partitioning and indexing techniques, is the property called scalability. Many solutions which work well with, say, five dimensions in each of which are ten categories, would not do well if presented with a data set having twenty dimensions in each of which are one hundred categories. Therefore “combinatorial explosion” is a key factor in evaluating how well a method scales in terms of its execution time and memory requirements. Other factors which affect scaling are the number of data points stored, and their distribution (clustering). In the general field of analysis methods for multidimensional data with hierarchical categories, there are two main areas of focus. Data Mining seeks correlations inside a data set. Online Analytical Processing (OLAP) derives summations or other calculations from a data set according to a specific query submitted by a user or an automated report. When attempting to discover correlations or answer queries about multiple parameters (also called “measures” and “dimensions”) in the data set, two fundamental techniques are useful:    1) the ability to quickly retrieve those nodes (data points) which are pertinent;    2) the ability to have some retrieving be from pre-computed summary nodes.
These techniques become more important as the size and complexity of the data grow. When a parameter takes on symbolic (categorical) values, complexity will increase sharply if the “types” associated with it belong to a multi-level hierarchy. When more of these parameters are added to the data, the number of distinct summary nodes possible increases exponentially, so care should be taken lest the cost of having summary nodes outweighs their benefit. The prior state of the art did not manage this trade-off well.
U.S. Pat. No. 5,359,724 (1994) to Earle provides access to sparse data by having a two-level index; one level has those dimensions which the user believes will result in dense data, and another level has those dimensions he believes will result in sparse data. The index for dense data trades wasted space for faster access, but any query that involves dimensions in the sparse index will incur a slower search. This compromise does little to assure efficiency, and what efficiency there is depends on information supplied by the user as to which dimensions are designated for faster indexing.
U.S. Pat. No. 5,511,190 (1996) to Sharma et al. improves performance of SQL queries involving groups, through use of a hash-based lookup for pre-computed partial aggregate data which may be combined to produce the answer. The amount of improvement claimed due to this hashing method is modest, and decreases as the number of groups included increases. Limitations include its reliance on an RDBMS and SQL, and its lack of support for hierarchical levels within columns.
Other methods, including those in U.S. Pat. No. 6,154,746 (2000) to Berchold et al., U.S. Pat. No. 6,003,036 (1999) to Martin, and U.S. Pat. No. 6,212,515 (2001) to Rogers, deal with partitioning data on secondary storage and do not pre-aggregate. The technique described in U.S. Pat. No. 5,890,151 (1999) to Agrawal et al., which uses covering codes to help retrieve pre-aggregated values for combining into the answer to a query, seems to have drawbacks in its scalability and flexibility.
Another type of analytical processing for multi-dimensional data is the operation called “clustering”, which can group objects or data points according to their similarity (proximity in N-dimensional space). This can be used to return the “nearest neighbors”, or most similar data records, in response to a query pattern, but is not well-suited for the hierarchical aggregation operations required for business data analysis. U.S. Pat. No. 6,134,541 (2000) to Castelli et al. describes such a system.
Although U.S. Pat. No. 5,978,788 (1999) to Castelli et al. describes a technique for generating data cubes capable of providing approximately correct answers using aggregation and OLAP data cubes, neither it nor any other of the techniques discussed so far claim to support hierarchical fuzzy pattern matching for answering aggregation queries. Nor do they offer a method to support aggregation where types in a categorical dimension exhibit the property of multiple inheritance. So many prior art techniques are inflexible in the sense that they are not easily adapted for fuzzy pattern matching, multiple inheritance, or the “children-of” function described in the following section.
More recent inventions include U.S. Pat. No. 6,487,546 (2002) to Witkowski which seems to ask, ‘since a database system almost certainly has a B-tree index why not add a little aggregation data to it?’ The answer is just as straightforward: because it is generally of no use whatsoever in answering multi-dimensional OLAP queries. More significant is the work of Proctor (2001 U.S. Pat. No. 6,289,352) with his ‘virtual storage rack’, and of Bakalasb et al. (U.S. RPA 2002/0194167) with their segmented aggregation and ‘slice-storage method.’ Although these offer certain incremental advantages over the prior art, they are limited by the row/column mindset that comes out of the relational database tradition.
Johnson and Shasha in U.S. Pat. No. 6,424,967 (2002) take an alternative ‘cube-forest’ approach, but the benefits are limited due to their reliance on trees rather than DAGs (directed-acyclic graphs). In particular, they suggest the application must‘update-early-and-in-large-batches’ due to the high cost for updates, which is also largely true of aforementioned methods including Agrawal (1999), Proctor (2001) and Bakalash. Fortunately, this is not true of the present invention which, unlike the prior art, teaches the use of partial-order indexing and traversal for high efficiency with sparse data. This method and its advantages are not obvious nor easily appreciated upon first reading. This is partly due to there being no prior use in this field of art, and partly due to the surprising benefits. Note also that the technique bears no direct correspondence to the relational database-owing more to multi-dimensional databases, Information Retrieval and mathematics (set theory).
Clarification of Terminology
A topologically-ordered search is a breadth-first search which is guaranteed to finish with all levels≦k before comparing the key for any node at level k+1. (In this context, a node's level is its maximum path length from Top.) Associative retrieval uses a data pattern to search a database. Even if the query pattern has no duplicate in the database, retrieval may still occur once the pattern's proper position in the database is found. In the analytical field, the typical response is to return the specializations of that pattern. (That is, to return those data nodes which satisfy the query by being specific instances of it.) This is an essential point that prior art missed: given sparse multi-dimensional data, great efficiencies in time and space are achieved using associative search (in a unified partial-order data structure) combined with on-the-fly aggregation.
As an example, consider the following query loosely taken from the industry standard OLAP APB-1. Benchmark:
get UNITS SOLD, DOLLAR SALES, AVERAGE PRICE
by PRODUCT=“Truck Tires”
by CUSTOMER=“GM”
by CHANNEL=attribute (“level”, “TOP”)
by TIME=“1995”
The same query translated into two different pattern forms could be:
    Bare array form: Truck Tires, GM, Top, 1995    Concept graph form: Product: Truck Tires, Customer: GM, Time: 1995These are both equivalent to the query above. But the actual query from the benchmark would have looked more like:
get UNITS SOLD, DOLLAR SALES, AVERAGE PRICE
by PRODUCT=ChildrenOf (“Truck Tires”)
by CUSTOMER=“GM”
by CHANNEL=attribute (“level”, “TOP”)
by TIME=ChildrenOf (“1995)
So instead of a single line of output data for units, dollars and average price, the query calls for M×N lines of output, where M is the number of product types in the hierarchy level directly below “Truck Tires”, and N is the number of time types directly below “1995”. For example, one line might read:
ATX 75R15 GM, Top, 95Q2, 2116, $53196.25, $25.14
This output line would represent the aggregation of many data points which might vary in each of the four dimensions, yet satisfy the description. For example, consider:
A: ATX 75R15, GM Plant#12, WarehouseA, 9505
B: ATX 75R15, GM Plant#18, WarehouseA, 9506
These patterns are both children of “ATX 75R15, GM, Top, 95Q2” yet A is incomparable to B. This is what we mean by a partial-order: that just because A is not strictly-less-than B and B is not strictly-less-than A, we cannot infer that A=B. (In a full-ordering we could make such an inference.) Formally, given a relation for preceeds, or strictly-less-than, a set is a partial-order if it satisfies three properties: transitivity, asymmetry and irreflexivity.
In a typical OLAP schema, each dimension has its own type hierarchy, usually in the form of a tree. A tree is a form of directed graph wherein no node can have more than one parent; that is, it always branches out, never joining back. The present invention is based on a data structure more suited to partial-orders: the directed-acyclic graph. If the root of each dimension's type hierarchy is called Top, then the root of our OLAP is the Top pattern, which consists of the Top value for each dimension. The data structure will consist of nodes each having a pattern part and a measures part, starting with the Top node and proceeding down through summary nodes and eventually terminating in leaf nodes. Each node is contained in the nodes above it, traceable through the graph arcs or edges, in the sense that its parents' patterns are less specific than its pattern and its parents' measures encompass its measures. Thus our structure for aggregating N-dimensional data is organized primarily on the basis of N-dimensional patterns.
Each node in the partial-order database will be associated with one of these patterns and linked to those nodes (patterns) to which it most directly relates. From the point of view of one of these nodes, the direct link(s) upwards connect to its parent(s), also called “immediate predecessors”, while the direct link(s) downwards connect to its children, also called “immediate successors.”The result of searching with a query pattern will be to locate the position of that pattern within the database. If the pattern is exactly matched to an existing node in the partial-order, that node can easily provide the answer. However, should the query match no existing node, then the answer will derive from the successors to the query pattern. These nodes and links together form a graph to fully embody all stored fact data, all the summary/index nodes, and the relationship between them. This enables a single topologically-ordered search, following these links, to locate the answer to any query relying only upon said graph structure, the pattern-comparison test, the type hierarchy, and simple arithmetic.
The basic method of partial-order search has been known for some time. Robert Levinson, in his PhD Thesis, Technical Report AI-85-05, University of Texas at Austin, 1985, presents these methods as applied to the problem of classification. However, prior art methods are slow and must make many comparison tests to navigate through the nodes at the top of the partial-order. What is needed is some means to accelerate search in the average case.
Objects and Advantages
Accordingly, several objects and advantages of the present invention are:    1. to provide a method, in support of analysis of multi-dimensional data, that is efficient in time and in the amount of data storage required;    2. to scale well: to be efficient across a broad variation in the number of dimensions, number of categorical types, and number of data points;    3. to efficiently support the “children-of” operator required by the OLAP Council's APB-1 benchmark;    4. to be able to accommodate a form of fuzzy pattern matching;    5. to be able, with some constraints, to operate correctly and efficiently even when the type hierarchy contains instances of multiple-inheritance;    6. to be adaptable in that it can either work in conjunction with an existing data warehouse system, or it can work in conjunction with an existing analysis tool, or it can be used as the basis for building a new system or analysis tool.
In general, the object of this invention is not merely to perform better than prior art multi-dimensional database systems for the typical application of today, but to perform well while providing the ability to expand into new areas of symbolic data analysis. Additional objects and advantages will be set forth in the description which follows, and in part will be obvious from the description and with the accompanying drawings, or may be learned from the practice of this invention.