1. Field of the Invention
The present invention relates to the multidimensional data modeling, and more particularly, to detecting exceptions in multidimensional data.
2. Description of the Prior Art
On-Line Analytical Processing (OLAP) is a computing technique for summarizing, consolidating, viewing, applying formulae to, and synthesizing data according to multiple dimensions. OLAP software enables users, such as analysts, managers and executives, to gain insight into performance of an enterprise through rapid access to a wide variety of data views that are organized to reflect the multidimensional nature of the enterprise performance data. An increasingly popular data model for OLAP applications is the multidimensional database (MDDB), which is also known as the data cube. OLAP data cubes are predominantly used for interactive exploration of performance data for finding regions of anomalies in the data, which are also referred to as exceptions or deviations. Problem areas and/or new opportunities are often identified when an anomaly is located.
An exception is defined by first considering a two-dimensional data cube having p values along a first dimension A and q values along a second dimension B. The element or quantity corresponding to the ith value of dimension A and jth value of dimension B is denoted as yij. To estimate the degree of surprise yij holds in this data cube, an expected value ŷij of yij. is calculated as a function f of three terms: (1) a term xcexc that denotes a trend that is common to all y values of the cube, (2) a term xcex1i that denotes special trends along the ith row with respect to the rest of the cube, and (3) a term xcex2j that denotes special trends along the jth column with respect to the rest of the cube. The residual difference rij between the expected value ŷij=f(xcexc,xcex1i,xcex2j) and the actual value yij represents the degree of surprise of element yij based on its position in the cube.
When the data cube has three dimensions, for example, with dimension C being the third dimension, the expected value ŷijk is calculated by taking into account not only the kth value of the third dimension, but also the three values corresponding to the pairs (i,j) in the AB plane, (i,k) in the AC plane and (j,k) in the BC plane. The expected value ŷijk is then expressed as a function of seven terms as:
ŷijk=f(xcexc,xcex1i,xcex2j,xcex3k,(xcex1xcex2)ij,(xcex1xcex3)ik,(xcex3xcex2)kj),xe2x80x83xe2x80x83(1)
where (xcex1xcex2)ij denotes the contribution of the ijth value in the AB plane, (xcex1xcex3)ik denotes the contribution of jkth value in the AC plane, and (xcex3xcex2)kj) denotes the contribution of the kjth value in the BC plane. In general, for any k dimensional cube, the y value can be expressed as the sum of the coefficients corresponding to each of the 2k-1 levels of aggregations or group-bys of the cube. To illustrate, a 3-dimensional cube will be considered.
The function f( ) can take several forms or models. Two particularly useful forms are an additive form, where function f( ) is a simple addition of all its arguments, and a multiplicative form, where function f( ) is a product of all its arguments. The multiplicative form can be transformed to the additive form by performing a logarithm on the original data values. Thus, the final form of Eq. (1) is,
yijk=ŷijk+rijk=xcexc+xcex1i+xcex2j+xcex3k+(xcex1xcex2)ij+(xcex1xcex3)ik+(xcex3xcex2)kj,xe2x80x83xe2x80x83(2)
where rijk is the residual difference between the expected value ŷij=f(xcexc,xcex1i,xcex2j) and the actual value yij. The relative importance of an exception is based on the relative value of its residual, that is, the higher the value of the residual, the higher the importance of the exception.
For a multiplicative model, the yijk values denote the log of the original y-values of the cube. The choice of the best form of the function depends on the particular class of data, and is preferably selected by a user having the understanding and experience with the data at hand.
There are several ways for deriving values of the coefficients of Eq. (2). One approach is by a mean-based method where the coefficients are estimated as follows:
xcexc=y. . . =overall mean or averagexe2x80x83xe2x80x83(3)
xcex1i=yj . . . xe2x88x92xcexc,xe2x80x83xe2x80x83(4)
where yi . . . is the mean over all numbers with the ith value of A. Thus, for a two-way table, xcex1i denotes the magnitude of the difference of the average of the numbers along the ith row from the overall average xcexc.
xcex2j=y.j.xe2x88x92xcexc,xe2x80x83xe2x80x83(5)
where y.j. is the mean over all numbers with the jth value of B.
xcex3k=y. . . kxe2x88x92xcexc,xe2x80x83xe2x80x83(6)
where y. . . k is the mean over all numbers with the kth value of C. Lastly,
(xcex1xcex2)ij=yijxe2x88x92xcex1ixe2x88x92xcex2jxe2x88x92xcexcxe2x80x83xe2x80x83(7)
The remaining terms are defined analogously.
In general, the coefficient corresponding to any group-by G is recursively calculated by subtracting all coefficients from group-bys that are at a smaller level of detail than group G from the average y value at group-by G.
The mean-based approach for calculating the coefficients is not particularly robust in the presence of extremely large outliers. Consequently, a number of well-known alternative approaches for handling large outliers can be used, such as the median polish method and the square combining method, disclosed by D. Hoaglin et al., Exploring data tables, trends and shapes, Wiley series in probability, 1988, and incorporated by reference herein. These two alternative approaches are based on using a xe2x80x9cmedianxe2x80x9d instead of xe2x80x9cmeanxe2x80x9d for calculating the coefficients. Nevertheless, these alternative approaches have an associated high computational cost. Consequently, the mean-based approach is preferred for most OLAP data sets because significantly large outliers are uncommon in most data sets.
The method for determining residual and coefficients can be extended to handle hierarchies along one or more dimensions of a data cube. The basic idea is to define the expected value of a data cube element, not only based on its row and column position, but also on its hierarchical parents. For instance, consider values yij in a data cube consisting of two dimensions A and B, where dimension A has two levels of hierarchies: A1xe2x86x92A2xe2x86x92ALL. To calculate an expected value ŷij at the A1B level, the row coefficient xcex1i at level A1, the column coefficient xcex2j at level B and overall coefficient xcexc at level ALL, two new terms corresponding to the two new aggregations A2 and A2B along the hierarchy on A are used. Equation (2) thus becomes:
ŷij=xcexc+xcex1i+xcex2jxcex1xe2x80x2i+(xcex1xe2x80x2xcex2)ijxe2x80x83xe2x80x83(8)
where ixe2x80x2 denotes the parent of i at hierarchy level A2, xcex1xe2x80x2i denotes the contribution of the ith value at level A2, and (xcex1xe2x80x2xcex2)ij denotes the contribution due to the ijth value at level A2B.
The general formula for handling hierarchies is to express a y value in a cube in terms of coefficients obtained from all higher level aggregations of the cube. For example, for the y-values at A1B in Eq. (8), coefficients from the five higher level aggregates are used, that is, A1, A2, B, A2B, and ALL. The same recursive rule of the previous subsection is followed for estimating the coefficients, where xcexc is first estimated as the overall average, and then for terms corresponding to each group-by G. The average at group-by G is computed and then the coefficients from each child of G are subtracted from the computed average.
The current xe2x80x9cdeviation detectionxe2x80x9d module of the DB2 OLAP Server uses a log-linear form of function xe2x80x98fxe2x80x99 described in Eq.(1). A multiplicative or log-linear function is well justified for many of the OLAP cubes that have distributive aggregation functions, such as sum and count, associated with the dimensions of the cube. However, a log-linear function may not be appropriate in cases where the OLAP cube is built with dimensions that do not exhibit distributive aggregation. In such a case, a selection of a log-linear function could be inappropriate or insufficient and can lead to only partially useful results. If the aggregation function is non-distributive, i.e., if it can be expressed as an algebraic function of some finite distributive arguments, then an additive or linear function may be an appropriate choice.
It is an object of the present invention to provide an improved method for selecting a function and for exploring exceptions in multidimensional data.
It is another object of the present invention to provide such a method where the function is a member of a set of functions including a log-linear function and a linear function.
These and other objects of the present invention are achieved by a method for selecting a function for use in detecting a presence of an exception in multidimensional data. The method comprises the steps of (a) quantifying a characteristic of the multidimensional data, (b) determining a weighting factor of a function based on the quantified characteristic, and (c) selecting the function from a plurality of functions based on the weighting factor. The function is used for distinguishing the presence of an exception.
Another embodiment of the present invention is method for selecting a function for use in detecting a presence of an exception in multidimensional data, comprising the steps of (a) quantifying a data distribution characteristic of the multidimensional data, (b) quantifying a distributive characteristic of an aggregation function of the multidimensional data, (c) determining a weighting factor of a function based on the quantified data distribution characteristic and the quantified distributive characteristic of the aggregation function, and (d) selecting the function from a plurality of functions based on the weighting factor. The function from the plurality of functions is used for distinguishing the presence of an exception.