1. Field of the Invention
The present invention is directed to the field of computer-based multidimensional data modeling. It is more particularly directed to interpreting, explaining, and manipulating exceptions in multidimensional data on a computer system.
2. Description of the Background Art
On-Line Analytical Processing (OLAP) is a computing technique for summarizing, consolidating, viewing, analyzing, applying formulae to, and synthesizing data according to multiple dimensions. OLAP software enables users, such as analysts, managers and executives, to gain insight into performance of an enterprise through rapid access to a wide variety of data xe2x80x9cviewsxe2x80x9d or xe2x80x9cdimensionsxe2x80x9d that are organized to reflect the multidimensional nature of the enterprise performance data. An increasingly popular data model for OLAP applications is the multidimensional database (MDDB), which is also known as the xe2x80x9cdata cube.xe2x80x9d OLAP data cubes are often used by a data analyst for interactive exploration of performance data for finding regions of anomalies in the data, which are also referred to as xe2x80x9cexceptionsxe2x80x9d or xe2x80x9cdeviations.xe2x80x9d Problem areas and new opportunities associated with the enterprise are often identified when an anomaly in the enterprise data is located.
An exception represents the degree of surprise associated with data that is included in an OLAP data cube. An exception may be defined by means of an example. Given a two-dimensional data cube having xe2x80x9cpxe2x80x9d values along a first dimension xe2x80x9cA,xe2x80x9d and xe2x80x9cqxe2x80x9d values along a second dimension xe2x80x9cB,xe2x80x9d the element or quantity corresponding to the ith value of dimension A and jth value of dimension B is denoted as, xe2x80x9cyij.xe2x80x9d To estimate the exception, yij, in this data cube, an expected value, xe2x80x9cŷij,xe2x80x9d of yij is calculated as a function, xe2x80x9cf(),xe2x80x9d of three terms: (1) a term xe2x80x9cxcexcxe2x80x9d that denotes a trend that is common to all y values of the cube, (2) a term xe2x80x9cxcex1ixe2x80x9d that denotes special trends along the ith row with respect to the rest of the cube, and (3) a term xe2x80x9cxcex2jxe2x80x9d that denotes special trends along the jth column with respect to the rest of the cube. The residual difference xe2x80x9crijxe2x80x9d between the expected value ŷij=f(xcexc,xcex1i,xcex2j) and the actual value yij represents the relative importance of the exception, yij, based on its position in the cube.
By means of further explanation, when a data cube has three dimensions, for example, with dimension, xe2x80x9cC,xe2x80x9d being the third dimension, the expected value ŷijk is calculated by taking into account not only the kth value of the third dimension, but also the three values corresponding to the pairs (i,j) in the AB plane, (i,k) in the AC plane and (j,k) in the BC plane. The expected value ŷijk is then expressed as a function of seven terms as:
ŷijk=f(xcexc,xcex1i,xcex2j,xcex3k,(xcex1xcex2)ij,(xcex1xcex3)ik,(xcex3xcex2)kj),xe2x80x83xe2x80x83(1)
where (xcex1xcex2)ij denotes the contribution of the ijth value in the AB plane, (xcex1xcex3)ik denotes the contribution of jkth value in the AC plane, and (xcex3xcex2)kj denotes the contribution of the kjth value in the BC plane. In general, for any k-dimensional cube, the y value can be expressed as the sum of the coefficients corresponding to each of the 2kxe2x88x921 levels of aggregations or group-bys of the cube. The xe2x80x9ccoefficientxe2x80x9d represents a component that provides information used in making predictions about the expected value of ŷ and a xe2x80x9cgroup-byxe2x80x9d represents different combinations of the dimensions associated with the multidimensional cube. In the present example, group-bys include xe2x80x9cABxe2x80x9d and xe2x80x9cABC.xe2x80x9d Therefore, a coefficient is a group-by component that contributes to predictability of a cell in a multidimensional cube. The coefficient model may be used to make predictions about the expected value of an exception.
By means of example, a three-dimensional cube will be considered. The function, f() can take several forms or models, such as an additive form, where function f() is a simple addition of all its arguments, and a multiplicative form, where function f() is a product of its arguments. It will be appreciated by those skilled in the art that the multiplicative form can be transformed to the additive form by performing a logarithm on the original data values. For a multiplicative model, the yijk values denote the log of the original y-values of the cube. The log is used to remove bias associated with the distribution. That is, taking the log will tend to normalize the distribution. The choice of the best form of the function depends on the particular class of data, and is preferably selected by a user having understanding and experience with the data at hand. For example, the distribution of the data is one of the factors that may be used to determine the best form of the function.
The final form of Equation One as shown in Equation Two is,
yijk=ŷijk+rijk=xcexc+xcex1i+xcex2j+xcex3k+(xcex1xcex2)ij+(xcex1xcex3)ik+(xcex3xcex2)kj,xe2x80x83xe2x80x83(2)
where rijk is the residual difference between the expected value ŷijk and the actual value yijk. The relative importance of an exception is based on the value of its residual. That is, the higher the value of the residual, the higher the importance of the exception.
There are several ways of deriving values of the coefficients of Equation Two. One way of deriving coefficients is shown in U.S. Pat. No. 6,094,651. The approach is a mean-based solution where the coefficients are estimated by taking the logs of all the relevant numbers and then the mean of the previous result. Taking the log will distribute the numbers so that the effect of large differences in the values of the cells is reduced. When the mean is derived a trend may be observed. In general, the coefficient corresponding to any group-by, xe2x80x9cG,xe2x80x9d is recursively determined, according to the mean-based solution, by subtracting the coefficients from group-bys that are at a smaller level of detail than, G, from the average y value at G.
The mean-based approach for calculating the coefficients is not particularly robust in the presence of extremely large numbers that are outliers. An xe2x80x9coutlierxe2x80x9d represents data that is related to a coefficient that deviates from the trend of the data by a significant amount. There are statistical methods for deciding when to keep or discard these suspected outlier data points. A number of well-known alternative approaches for handling large outliers can be used, such as the Median Polish Method and the Square Combining Method, disclosed by D. Hoaglin et al., Exploring Data Tables, Trends and Shapes, Wiley Series in Probability, 1988, and incorporated by reference herein. These two alternative approaches are based on using a xe2x80x9cmedianxe2x80x9d instead of xe2x80x9cmeanxe2x80x9d for calculating the coefficients. Nevertheless, these alternative approaches have an associated high computational cost. Consequently, the mean-based approach is preferred for most OLAP data sets because significantly large outliers are uncommon in most data sets.
The method for determining a residual, xe2x80x9crijk,xe2x80x9d may be determined from Equation Two as shown in Equation Three.
rijk=|yijkxe2x88x92ŷijk|xe2x80x83xe2x80x83(3)
The greater the value of rijk, the more likely that the cell in the multidimensional data for which an expected value is being calculated is an exception in the data model. However, the residual value may need to be standardized for a meaningful comparison of multidimensional data. A xe2x80x9cstandardized residual valuexe2x80x9d is calculated as shown in Equation Four.
sr=|yijkxe2x88x92ŷijk|/"sgr"ijkxe2x80x83xe2x80x83(4)
The step of standardization is performed because the magnitude of the residual may appear to be significantly larger than the other values considered. Considering the magnitude of the residual alone can be misleading because the residual should be evaluated in relation to the data in the neighboring cells. Normalization of the data is achieved by applying a standard deviation ("sgr"ijk) to the process. It will be appreciated by those skilled in the art that there are many methods of calculating a standard deviation associated with data in the multidimensional cube. The standardized residual can then be used to rank the exceptions that are found. The higher the value of the standardized residual, the higher is the rank. The first exception in a decreasingly sorted array of exceptions will have the highest value of the standardized residual. A residual approach however is limited since the user views actual data and not the residual details, therefore the interpretation and explanation of an exception is not always obvious to the user.
The process of determining and analyzing a multidimensional cube exception is quantitative, while the analyst would like to use a qualitative approach. The information that is viewed in a quantitative approach, such as the coefficient approach, can be overwhelming. This happens because the number of possible two-dimensional or three-dimensional views that spawn three or two dimensions from the OLAP multidimensional sub-cube increases steeply. The number of possible three-dimensional views for N dimensions are (N)*(Nxe2x88x921)*(Nxe2x88x922)/6 and two-dimensional views are (N)*(Nxe2x88x921)/2. For example, if we have a cube with 7 dimensions, then the number of views for the end-user to analyze, xe2x80x9cC,xe2x80x9d are C=(7*6)/2=21 for two-dimensional views, and C=(7*6*5)/6=35 for three-dimensional views. In the absence of any formal way to focus on a few dimensions, the analyst has to view all thirty-five or twenty-one views to identify the best view for an exception, which makes it difficult to interpret and explain an exception.
There is a solution described in U.S. Pat. No. 6,094,651 that addresses exceptions and uses the concept of maximal terms. In general, a coefficient approach is limited since large coefficients are typically associated with smaller dimensional terms and the explanations are often too broad, spanning more data than necessary. This method looks at two-dimensional or three-dimensional views. However, the views are used for analysis rather than to examine individual dimensions. This limits the amount of data that is used to determine trends in the data.
Missing values are not considered in current exception-related solutions. For example, in a dimension in which a portion of the cells have no value the coefficient approach or the residual approach may generate information that indicates a more significant exception than actually exists.
Another limitation in the current solutions is that a framework for assigning linguistic labels to an exception is not often provided. The information is presented in a quantitative manner instead of a qualitative manner. This limits data analysts from obtaining information about the value of the data that is presented.
From the foregoing it will be apparent that there is still a need to improve the interpretation, explanation, and manipulation of exceptions in multidimensional data on a computer system.
An embodiment of the present invention relates to systems, methods, and computer products that interpret, explain, and manipulate exceptions in multidimensional data on a computer system. The present invention is related to the field of computer-based multidimensional data modeling often used by data analysts. The present invention assists the data analyst by providing a simplified view of the multidimensional data that enables analysis of the important results of data exception exploration.
The preferred embodiment of the present invention operates with an exception-enhancing module and enables determination and analysis of an exception in a multidimensional cube by a qualitative approach. The overwhelming amount of information that may be viewed in a quantitative approach is reduced to the important information, according to the present invention. More particularly, the preferred embodiment of the present invention enables interpretation and explanation of selected exceptions in multidimensional data.
Further, the preferred embodiment of the present invention incorporates the effect of density of the data along each dimension. That is, in a dimension in which a portion of the cells have no value the preferred embodiment of the present invention re-evaluates the relative importance of an exception to ensure that the density of the data is considered. In the preferred embodiment of the present invention, the user may set a threshold value that indicates the acceptable data density in a data cube.
The preferred embodiment of the present invention also provides the framework necessary to assign linguistic meaning to relative exceptions associated with each dimension. This enables data analysts to obtain information about the value of the data that is presented.
An embodiment of the present invention is achieved by systems, methods, and computer products that interpret and explain exceptions that are selected from multidimensional data. The method comprises (a) associating at least one dimension with each selected exception, (b) qualifying individual contributions for each dimension related to the exception of the multidimensional data, (c) assigning a density threshold preset rule to each dimension; (d) determining a density correction factor of the contribution, (e) determining a density-corrected contribution based on the density correction factor and the density threshold preset rule; (f) normalizing the density-corrected contributions for the dimensions associated with the exception, and (g) sorting the dimensions based on the normalized, density-corrected contributions for selection of the best two-dimensional or three-dimensional view for the exception. Qualification of individual contributions includes considering the influence of generations associated with the dimension. That is, generations associated with a dimension typically represent specific information that is related to the top-level dimensional category.
Further, the sorted data may be visually represented, such as by a histogram or a pie chart. An alternate embodiment of the present invention generates a framework for assigning linguistic meaning to the contributions from the relevant dimensions.
Other aspects and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the invention.