This invention relates generally to a method of reducing dimensionality of a set of attributes used to characterize a sparse data set and, more particularly, a method of reducing dimensionality of a set of attributes based on the calculated variance of the data values associated with each of the attributes.
Data mining identifies and extracts relevant data in a computer accessed database. In certain applications, a data set in a database may be in tabular or matrix format wherein the rows of the matrix represent individual observations and the columns of the matrix represent various attributes of the data. The cells of the matrix contain data values. Each observation or row of the matrix contains a data value for each of the attributes, that is, if a matrix data set has m observations (rows) and n attributes (columns), there will be mxc3x97n data values. In many applications, the number of non-trivial (that is, non zero) values per observation is much smaller than n. An example of this phenomenon occurs when attributes represent products sold in a supermarket while observations represent individual customer""s market baskets. Most customers purchase a very small fraction of all available products during a shopping trip, so the vast majority of entries in each row is zero. This condition is commonly referred to as the matrix being xe2x80x9csparse.xe2x80x9d When this condition of a sparse matrix does not hold, the matrix is commonly referred to as being xe2x80x9cdense.xe2x80x9d
It is often advantageous to obtain a reduced set of k attributes to characterize data in the matrix where k less than n. This will be referred to as reducing the dimensionality of the matrix data set. One technique that has been used to reduce the number of attributes characterizing a matrix data set is referred to as singular value decomposition (SVD). The SVD method generates a set of k equations (describing the new attributes) with each new attribute being a linear combination of the original n attributes. Disadvantages of the SVD method include:
1) computational complexityxe2x80x94the SVD method requires computation time, CT, on the order of CT=A*Q*k*log(n), where A is a constant, Q is the number of nonzero entries in a data matrix, k is number of attributes in the reduced set of attributes and n is the number of attributes in the original set of attributes;
2) results in a dense matrixxe2x80x94because each new attribute is a linear combination of the original attributes, the SVD method results in a matrix of data values that is dense; and
3) resulting data is nonintuitivexe2x80x94the results of applying the SVD method do not have an intuitive interpretation since each of the resulting k attributes is a linear combination of the original attributes. Thus, while originally each attribute corresponded to, for example, a particular product, a new attribute might be something like
xe2x80x9c2.0*white breadxe2x88x920.3*cheddar cheese+0.7*peanutsxe2x80x9d. It is generally very difficult to extract the xe2x80x9cmeaningxe2x80x9d of such an attribute.
Values of an attribute may be continuous (e.g., age, height, or weight of a respondent) or discrete (e.g., sex, race, year). Discrete attributes, that is, attributes having data values that are discrete variables, have a finite number of data values. Certain discrete attributes have only two data values (0 and 1) associated with the attribute, i.e., sexxe2x80x94male or female, where male=0 and female=1. Such discrete attributes will be referred to as dichotomous discrete attributes.
While the SVD method has several disadvantages, its major advantage is that it is a very effective methodology with regard to maintaining the distance structure between observations. Essentially, if the attribute data values associated with each observation are viewed as an n dimensional vector, the distance between two observations may be calculated as follows:
Define:
Observation no. 1: let the first data row, R1=[d11, d12, d13, d14, d15, . . . , d1n] where d11 is the data value for observation no. 1 and attribute no. 1, d12 is the data value for observation no. 1 and attribute no. 2, . . . , and d1n is the data value for observation no. 1 and attribute no. n.
Observation no. 2: let the second data row, R2=[d21, d22, d23, d24, d25, . . . , d2n] where d21 is the data value for observation no. 2 and attribute no. 1, d22 is the data value for observation no. 2 and attribute no. 2, . . . , and d2n is the data value for observation no. 2 and attribute no. n.
Calculate distance value (DIST 1xe2x88x922) between the pair of observations nos. 1 and 2 as follows:
DIST 1xe2x88x922=[(d11xe2x88x92d21)2+(d12xe2x88x92d22)2+(d13xe2x88x92d23)2+(d14xe2x88x92d24)2+ . . . +(d1nxe2x88x92d2n)2]xc2xd
Assuming similar distance values are determined for each of the m*(mxe2x88x921)/2 pairs of the m observations, the SVD method has been found to be xe2x80x9crobustxe2x80x9d in maintaining the distance structure between all pairs of observations while reducing the dimension of the attributes characterizing the data set from n attributes to k attributes. The SVD method is xe2x80x9crobustxe2x80x9d in maintain the distance structure between the data points in the following sense. Let the distortion of a data point (a row) be equal to the square of the difference between its original distance from the origin and its distance from the origin after dimensionality reduction. Then, among all possible dimensionality reductions to k dimensions from the n original dimensions, the SVD method minimizes the sum of the distortion over all points.
The present invention provides a dimensionality reduction method useful in sparse data sets that is effective and efficient when implemented in software for use in connection with data mining of a database management system. The dimensionality reduction method for use in a sparse database substantially preserves the distance structure between observations. Sparse data matrices are predominant in most real world data mining applications.
The attribute reduction method of the present invention may be used for both continuous and discrete attribute data and is best suited for matrix data sets that are sparse, that is, data sets that have a high proportion of zero values. Many real world marketing-related databases have a large number of discrete attributes that have dichotomous or two state data values most of which have zero values. For example, assume that a company sells a large variety of products and the company has established a matrix in its database to track customer purchases. If the rows of a matrix represent customers of a company and the attributes of the matrix correspond to whether or not a customer has purchased each of the products sold by the company with cell having a value of 1 if customer i has purchased product j and having a value of 0 if customer i has not purchased product j. In a matrix such as this, it could easily be the case that 90% or more of the cell data values or cell entries have a value of zero.
It is an object of this invention to provide a method of efficiently reducing the number of attributes that characterize set of data values in a data matrix D where the data matrix is sparse, that is, has a high proportion of zero values. The data matrix D is defined by an mxc3x97n matrix wherein the m rows of the matrix represent individual observations, e.g., a customer, an event or occurrence, a location, etc., and the n columns represent attributes of the observations. The cells of the matrix D contain specific data values for the associated observation and attribute. A reduced dimension matrix, Dnew, including m observations and k attributes (k less than n) is desired. It is assumed that the magnitude of the value of k is set by a user or is arbitrarily set based on the number of attributes in the full scale matrix D.
One aspect of the present invention is a dimensionality reduction method of selecting k attributes from a set of n attributes to characterize data values in a data matrix D having m observations. The value of the data matrix entry in row i and column j is denoted as dij. The steps of the method include the following:
a) for each of the attributes Aj (j=1 . . . , n), calculate the variance of the attribute, where the variance of attribute Aj is calculated as follows:             Var      ⁡              (        Aj        )              =                  [                  1          /          m                ]            *                        ∑                      i            =            1                    m                ⁢                  xe2x80x83                ⁢                              (                          dij              -                              Mean                ⁡                                  (                  Aj                  )                                                      )                    2                                        where        ⁢                  xe2x80x83                ⁢        Mean        ⁢                  xe2x80x83                ⁢                  (          Aj          )                    =                        [                      1            /            m                    ]                *                              ∑                          i              =              1                        m                    ⁢                      xe2x80x83                    ⁢                      d            ij                                ;  
b) select the k attributes having the greatest variance values; and
c) generate a mxc3x97k data matrix Dnew by selecting those data values corresponding to the m observations and the selected k attributes.
In the dimensionality reduction method of the present invention, the reduced set of k attributes accurately characterize the data of the original matrix D. By xe2x80x9caccurately characterize the dataxe2x80x9d it is meant that the selected k attributes are attributes in the original matrix D that can be best used to effectively differentiate between observations and are useful, for example, in making predictions about an observation based on the data values of the k attributes for the observation. For example, an attribute that has the same data values for each of the observations cannot be used to differentiate between the observations and would not be useful in making predictions about an observation. By xe2x80x9caccurately characterize the dataxe2x80x9d it is also meant that the distance structure between pairs of observations will be maintained to the extent possible between the original matrix D and the reduced dimension matrix Dnew. The fact that the reduced set of k attributes selected by the dimensionality reduction method of the present invention xe2x80x9caccurately characterize the dataxe2x80x9d is borne out by favorable experimental results. Moreover, for the case where the covariance matrix is diagonal, one can prove that the method of the present invention produces the exact same results as the SVD method.
When operated on a sparse data set (90% or more zero entries), the dimensionality reduction method of the present invention generally preserves the distance structure between the observations while requiring a computation time, CT, on the order of CT=B*Q, where B is some constant value and Q is the total number of nonzero entries in the matrix D. In comparison the SVD method generally requires a computation time, CT, on the order of CT=A*Q*k*log(n), where A is a constant, Q is the number of nonzero entries in a data matrix, k is the number of attributes in the reduced set of attributes and n is the number of attributes in the original set of attributes. As can be seen from the respective CT equations, the sparser the data the faster both methods are run on a computer. However, the dimensionality reduction method of the present invention is computationally faster than the SVD method by a factor of k*log(n) and, additionally, the method of the present invention is also faster because empirically it has been found that A is a much smaller constant than B. Over a sparse data set, the dimensionality reduction method of the present invention preserves the favorable distance structure property of the SVD method while the reducing computational time required compared with the SVD method. Moreover, if an explicit reconstruction of the points in the reduced dimensionality space is needed, then SVD requires an extra reconstruction step. The extra reconstruction step takes time RT=C*m*k, where C is a constant, m is the number of observations and k is the number of attributes in the reduced set of attributes. The dimensionality reduction method of the present invention, on the other hand, requires no reconstruction time since the representation in the reduced dimensionality space is merely the result of ignoring certain nxe2x88x92k of the original attributes.
Moreover, in the method of the present invention, the k attributes of the new data matrix Dnew are simply a subset of the n attributes of the original data matrix D. Since the original data set D is sparse, so to is the new data matrix Dnew. The SVD method does not have this desirable property. Thus, while both the method of the present invention and the SVD method output an mxc3x97k new data matrix, the method of the present invention outputs a sparse new data matrix while the SVD method outputs a dense new data matrix. Stated another way, key to the dimensionality reduction method of the present invention is the fact that the selected k attributes are a subset of the original n attributes and generation of the new reduced dimension matrix Dnew only involves picking those data values from the original matrix D that correspond to the selected k attributes. This is in contrast to the SVD method wherein each data value of the reduced dimension matrix must be calculated as a linear combination of each of the original attribute data values. The ease of computation, the intuitive nature of the data values selected for inclusion in the Dnew matrix, and the maintenance of the sparseness of the original D matrix of the method of the present invention versus the SVD method are important advantages of the present invention.
These and other objects, advantages, and features of the present invention are described in detail in conjunction with the accompanying drawings.