1. Field of the Invention
The present invention relates to multidimensional data, and more particularly, to a method and system for providing a function for use in finding exceptions in the multidimensional data.
2. Description of the Prior Art
On-Line Analytical Processing (OLAP) is a computing technique for summarizing, consolidating, viewing, applying formulae to, and synthesizing data according to multiple dimensions. OLAP software enables users, such as analysts, managers and executives, to gain insight into performance of an enterprise through rapid access to a wide variety of data views that are organized to reflect the multidimensional nature of the enterprise performance data. An increasingly popular data model for OLAP applications is the multidimensional database (MDDB), which is also known as the data cube. OLAP data cubes are predominantly used for interactive exploration of performance data for finding regions of anomalies in the data, which are also referred to as exceptions or deviations. Problem areas and/or new opportunities are often identified when an anomaly is located.
An exception is defined by first considering a two-dimensional data cube having p values along a first dimension A and q values along a second dimension B. The element or quantity corresponding to the ith value of dimension A and jth value of dimension B is denoted as yij. To estimate the degree of surprise yij holds in this data cube, an expected value ŷij of yij is calculated as a function f of three terms: (1) a term xcexc that denotes a trend that is common to all y values of the cube, (2) a term xcex1i that denotes special trends along the ith row with respect to the rest of the cube, and (3) a term xcex2j that denotes special trends along the jth column with respect to the rest of the cube. The residual difference rij between the expected value ŷij=f(xcexc, xcex1i, xcex2j), and the actual value yij represents the degree of surprise of element yij based on its position in the cube.
When the data cube has three dimensions, for example, with dimension C being the third dimension, the expected value ŷijk is calculated by taking into account not only the kth value of the third dimension, but also the three values corresponding to the pairs (i,j) in the AB plane, (i,k) in the AC plane and (j,k) in the BC plane. The expected value ŷijk is then expressed as a function of seven terms as:
ŷijk=f(xcexc, xcex1i, xcex2j, xcex3k,(xcex1xcex2)ij,(xcex2xcex3)ik),(xcex3xcex2)kj),xe2x80x83xe2x80x83(1) 
where (xcex1xcex2)ij denotes the contribution of the ijth value in the AB plane, (xcex1xcex3)ik denotes the contribution of jkth value in the AC plane, and (xcex3xcex2)kj) denotes the contribution of the kjth value in the BC plane. In general, for any k dimensional cube, the y value can be expressed as the sum of the coefficients corresponding to each of the 2kxe2x88x921 levels of aggregations or group-bys of the cube. To illustrate, a 3-dimensional cube will be considered.
The function f( ) can take several forms or models. Two particularly useful forms are an additive form, where function f( ) is a simple addition of all its arguments, and a multiplicative form, where function f( ) is a product of all its arguments. The multiplicative form can be transformed to the additive form by performing a logarithm on the original data values. Thus, the final form of Eq. (1) is,
yijk=ŷijk+rijk=xcexc+xcex1i+xcex2j+xcex3k+(xcex1xcex2)ij+(xcex1xcex3)ik+(xcex3xcex2)jk,xe2x80x83xe2x80x83(2) 
where rijk is the residual difference between the expected value ŷij=f(xcexc, xcex1i, xcex2j) and the actual value yij. The relative importance of an exception is based on the relative value of its residual, that is, the higher the value of the residual, the higher the importance of the exception.
For a multiplicative model, the yijk values denote the log of the original y-values of the cube. The choice of the best form of the function depends on the particular class of data, and is preferably selected by a user having the understanding and experience with the data at hand.
There are several ways for deriving values of the coefficients of Eq. (2). One approach is by a mean-based method where the coefficients are estimated as follows:
xcexc=y. . . =overall mean or averagexe2x80x83xe2x80x83(3) 
xcex1ixe2x88x92yi . . . xe2x88x92xcexc,xe2x80x83xe2x80x83(4) 
where yi . . . is the mean over all numbers with the ith value of A. Thus, for a two-way table, xcex1i denotes the magnitude of the difference of the average of the numbers along the ith row from the overall average xcexc.
xcex2j=y.j.xe2x88x92xcexc,xe2x80x83xe2x80x83(5) 
where y.j. is the mean over all numbers with the jth value of B.
xcex3k=y. . . kxe2x88x92xcexc,xe2x80x83xe2x80x83(6) 
where y. . . k is the mean over all numbers with the kth value of C. Lastly,
(xcex1xcex2)ij=yijxe2x88x92xcex1ixe2x88x92xcex2jxe2x88x92xcexcxe2x80x83xe2x80x83(7) 
The remaining terms are defined analogously.
In general, the coefficient corresponding to any group-by G is recursively calculated by subtracting all coefficients from group-bys that are at a smaller level of detail than group G from the average y value at group-by G.
The mean-based approach for calculating the coefficients is not particularly robust in the presence of extremely large outliers. Consequently, a number of well-known alternative approaches for handling large outliers can be used, such as the median polish method and the square combining method, disclosed by D. Hoaglin et al., Exploring data tables, trends and shapes, Wiley series in probability, 1988, and incorporated by reference herein. These two alternative approaches are based on using a xe2x80x9cmedianxe2x80x9d instead of xe2x80x9cmeanxe2x80x9d for calculating the coefficients. Nevertheless, these alternative approaches have an associated high computational cost. Consequently, the mean-based approach is preferred for most OLAP data sets because significantly large outliers are uncommon in most data sets.
The technique for determining residual and coefficients can be extended to handle hierarchies along one or more dimensions of a data cube. The basic idea is to define the expected value of a data cube element, not only based on its row and column position, but also on its hierarchical parents. For instance, consider values yij in a data cube consisting of two dimensions A and B, where dimension A has two levels of hierarchies: A1xe2x86x92A2xe2x86x92ALL. To calculate an expected value ŷij at the A1B level, the row coefficient xcex1i at level A1, the column coefficient xcex2j at level B and overall coefficient xcexc at level ALL, two new terms corresponding to the two new aggregations A2 and A2B along the hierarchy on A are used. Equation (2) thus becomes:
ŷij=xcexc+xcex1i+xcex2j+xcex1xe2x80x2ixe2x80x2+(xcex1xe2x80x2xcex2)ixe2x80x2jxe2x80x83xe2x80x83(8) 
where ixe2x80x2 denotes the parent of i at hierarchy level A2, xcex1xe2x80x2ixe2x80x2 denotes the contribution of the ith value at level A2, and (xcex1xe2x80x2xcex2)ixe2x80x2j denotes the contribution due to the ijth value at level A2B.
The general formula for handling hierarchies is to express a y value in a cube in terms of coefficients obtained from all higher level aggregations of the cube. For example, for the y-values at A1B in Eq. (8), coefficients from the five higher level aggregates are used, that is, A1, A2, B, A2B, and ALL. The same recursive rule of the previous subsection is followed for estimating the coefficients, where xcexc is first estimated as the overall average, and then for terms corresponding to each group-by G. The average at group-by G is computed and then the coefficients from each child of G are subtracted from the computed average.
xe2x80x9cANalysis Of VAriancexe2x80x9d(ANOVA) is a well-known statistical technique described by D. G. Montgomery, in xe2x80x98Design and Analysis of experimentsxe2x80x99, Chapter 13, John Wiley and Sons, Third edition, 1991. Current models are adapted versions of the general ANOVA model.
Using the standard ANOVA based model using a linear function, and adapting it by extending Eq. (1) for n-dimensional data and for function xe2x80x98fxe2x80x99 as log-linear, for a value yi1,i2, . . . ,in in a cube C, e.g., a multidimensional data model, at position ir of the rth dimension dr (1xe2x89xa6rxe2x89xa6n), the anticipated value ŷi1,i2, . . . in as a function f of contributions from various higher group-bys as:                                           y            ^                                i1            ,            i2            ,                          …              ⁢                              xe2x80x83                            ⁢              i              ⁢                              xe2x80x83                            ⁢              n                                      =                  xe2x80x83                ⁢                  f          ⁡                      (                                                            γ                  G                                ⁡                                  (                                                            i                      r                                        ❘                                          d                                              r                        ∈                        G                                                                              )                                            ❘                              G                ⋐                                  {                                                            d                      1                                        ,                                          d                      2                                        ,                    …                    ⁢                                          xe2x80x83                                        ,                                          d                      n                                                        }                                                      )                                                                        y            ^                                i1            ,            i2            ,                          …              ⁢                              xe2x80x83                            ⁢              i              ⁢                              xe2x80x83                            ⁢              n                                      =                  xe2x80x83                ⁢                              ∏                          G              ⋐                              {                                                      d                    1                                    ,                                      d                    2                                    ,                  …                  ⁢                                      xe2x80x83                                    ,                                      d                    n                                                  }                                              ⁢                                    γ              G                        ⁡                          (                                                i                  r                                ❘                                  d                                      r                    ∈                    G                                                              )                                          
The xcex3 terms as the coefficients of the model equation are:             y      ^              i1      ,      i2      ,              …        ⁢                  xe2x80x83                ⁢        i        ⁢                  xe2x80x83                ⁢        n              =            ∑              G        ⋐                  {                                    d              1                        ,                          d              2                        ,            …            ⁢                          xe2x80x83                        ,                          d              n                                }                      ⁢                  Φ        G            ⁡              (                              i            r                    ❘                      d                          r              ∈              G                                      )                                                      Φ            0                    =                      xe2x80x83                    ⁢                                    Φ                              +                                  ,                                      +                                          ,                      …                      ,                      +                                                                                            =                          Overall mean  of  all  cells  of  a  certain  aggregation                                                                                xe2x80x83                    ⁢                                                    level where                            ⁢                              xe2x80x83                            ⁢                              Φ                                  i1                  ,                  i2                  ,                                      …                    ⁢                                          xe2x80x83                                        ⁢                    i                    ⁢                                          xe2x80x83                                        ⁢                    n                                                                        =                          log              ⁡                              (                                  y                                      i1                    ,                    i2                    ,                                          …                      ⁢                                              xe2x80x83                                            ⁢                      i                      ⁢                                              xe2x80x83                                            ⁢                      n                                                                      )                                                                    Φ      ir      Ar        =                  Φ                  i1          ,                      +                          ,                              …                +                            ,              ir              ,              +                                          -                        Φ          0                ⁢                  xe2x80x83                ⁢                  and  so  on …                    
Then the residual of the model is defined as:
ri1,i2, . . . in=|yi1,i2, . . . inxe2x88x92ŷi1,i2, . . . in|
The greater the value of ri1,i2, . . . in, in the more likely that the tuple (i.e., the cell in the multidimensional data for which an expected value is being calculated) is an exception in the data model. However, the residual value must be standardized for meaningful comparison. A xe2x80x9cstandardized residual valuexe2x80x9d is calculated as follows:
si1,i2, . . . in=|yi1,i2, . . . inxe2x88x92ŷi1,i2, . . . in|/"sgr"i1,i2, . . . in 
The step of standardization is performed because the magnitude of the residual might appear to be large if the values in the multidimensional data are large. Considering the magnitude of the residual alone can be misleading because the residual should be evaluated in relative terms of the neighborhood of the cell in order for it to be termed as an exception. This is achieved by a normalization process. The value of "sgr"i1,i2, . . . in is the normalizer. One way to calculate it is as associated standard deviation for that tuple yi1,i2, . . . in. The standardized residual can then be used to rank the exceptions that are found. The higher the value of the standardized residual, the higher is the rank. The first exception in a decreasingly sorted array of exceptions will have the highest value of the standardized residual.
In classical ANOVA techniques, the standard deviation for all the cells is assumed to be identical. The variance, i.e., square of standard deviation, is estimated as the sum of squares of the residuals divided by the number of entries. Unfortunately, this technique provides poor fits on OLAP data. In the analysis of contingency tables, where cell entries represent counts, the Poisson distribution is assumed. This assumption implies that the variance is equal to the mean. When the entries are not counts, e.g., large dollar values, this typically leads to an underestimate of the variance.
The variance is modeled as a power xcfx81 of the mean value ŷil. . . in as:
"sgr"i1i2. . . in2=(ŷi1i2. . . in)xcfx81.xe2x80x83xe2x80x83(9) 
To calculate xcfx81 one can use the maximum likelihood principle on data assumed to be distributed normally with the mean value ŷi1i2. . . in. The maximum likelihood principle is described by Cooley et. al., Multivariate Data Analysis, Robert E. Krieger Publishers, 1986, which is herein incorporated by reference. One can derive that the estimated value of xcfx81 that must satisfy:                                           ∑                                                                                                      (                                                                        y                                                                                    i                              1                                                        ⁢                                                          i                              2                                                        ⁢                            …                            ⁢                                                          xe2x80x83                                                        ⁢                                                          i                              n                                                                                                      -                                                                              y                            ^                                                                                                              i                              1                                                        ⁢                                                          i                              2                                                        ⁢                            …                            ⁢                                                          xe2x80x83                                                        ⁢                                                          i                              n                                                                                                                          )                                        2                                                                              (                                                                        y                          ^                                                                                                      i                            1                                                    ⁢                                                      i                            2                                                    ⁢                          …                          ⁢                                                      xe2x80x83                                                    ⁢                                                      i                            n                                                                                              )                                        ρ                                                  ·                log                            ⁢                                                y                  ^                                                                      i                    1                                    ⁢                                      i                    2                                    ⁢                  …                  ⁢                                      xe2x80x83                                    ⁢                                      i                    n                                                                                -                      ∑                          log              ⁢                                                y                  ^                                                                      i                    1                                    ⁢                                      i                    2                                    ⁢                  …                  ⁢                                      xe2x80x83                                    ⁢                                      i                    n                                                                                      =        0                            (        10        )            
The estimation of variance requires us to solve Eq. (10). A technique for solving Eq. (10) to find xcfx81 is discussed below. Since a closed form solution of this equation is not available, it is solved iteratively by evaluating Eq. (10) for a fixed set of values of xcfx81, for example 10 values equally spaced between 0 and 3. The final value is a point between the two points where the sign of the left hand side of Eq. (10) changes. If the value of xcfx81 happens to fall outside the chosen range, e.g., [0,3], then a re-scan of the data with an expanded range, e.g., [0,5], is performed.
Exceptions in a multidimensional data model, such as an OLAP cube, are deviations from underlying trends in the data model. The current xe2x80x9cdeviation detectionxe2x80x9d module of the DB2 OLAP Server uses a log-linear form of function xe2x80x98fxe2x80x99 described in Eq.(1). However, the log-linear function is not effective when the multidimensional data, e.g. an OLAP cube, includes negative values and/or zeros. That is, the log-linear function is effective only when the values in the multidimensional data are all positive. Because of this shortcoming, the log-linear function is inappropriate for many practical applications, such as commercial and financial multidimensional data, that involve zero or negative numbers as values in the multidimensional data, e.g., percentage of change, loss values, etc. There is also a possibility of resolution degradation if the numbers in the cube are nearly equal to zero.
It is an object of the present invention to provide an improved method for exploring for exceptions in multidimensional data that may include negative values, zero, and positive values.
It is another object of the present invention to provide such a method that by determining a function that is defined for all values of the multidimensional data.
It is yet another object of the present invention to provide such a method that efficiently recognizes exceptions by calculating a standarized residual value for an expected value of data.
These and other objects of the present invention are achieved by a method for providing a function for use in detecting a presence of an exception in multidimensional data, comprising the steps of (a) partitioning the multidimensional data into at least a first region and a second region; (b) assigning a first region-specific function to the first region and a second region-specific function to the second region; and (c) determining a combined function from the first region-specific function and the second region-specific function. The combined function is used to calculate an expected value of the multidimensional data for distinguishing the presence of an exception.
Another embodiment of the present invention is a method for providing a function for use in detecting a presence of an exception in multidimensional data. The method comprises the steps of (a) partitioning the multidimensional data into at least a first region and a second region, (b) assigning a first region-specific function to the first region and a second region-specific function to the second region, wherein each of the first and second region-specific functions satisfy a set of predetermined conditions in its assigned region, (c) determining a combined function from the first region-specific function and the second region-specific function, and (d) adjusting a limit of the first region and a limit of the second region so that the combined function satisfies the set of predetermined conditions for all values of the multidimensional data. The combined function is used to calculate an expected value of the multidimensional data for distinguishing the presence of an exception.