The present invention relates to databases, and more specifically to summarizing a large data set.
Data mining attempts to find patterns and relationships among data in a large data set and predict future results based on the patterns and relationships by searching for and fitting a statistical model to the data set. One of the chief obstacles to effective data mining is the clumsiness of managing and analyzing a very large data set. The process of model searching and model fitting often requires many passes over the data. However, fitting large data sets into physical memory may be infeasible.
One approach to the problem is constructing summaries of the large data set on which to base the desired analysis. However, devising general-purpose summaries is difficult. For a particular model, such as a set of multiple regression analyses, statistical theory often suggests sufficient statistics that can be computed in a single pass over a large data file without holding the file in memory. However, the appropriate summaries or statistics depend on having the desired model fixed in advance. The problem is that constructing summaries for permitting effective data mining depends on the desired model, whereas selecting a well-fitting model first requires data mining.
A second approach to the problem is drawing a random sample of the large data set on which to base the analysis. Drawing a random sample of the large data set is easy to achieve. Analyzing the sample may take advantage of any statistical method, unrestricted by a possibly unfortunate choice of summary statistics. The biggest disadvantage of random sampling, however, is that sampling variance introduces inaccuracy.
An aspect of the present invention features techniques for representing a large data file with a condensed summary having the same format as the large data file.
In general, in a first aspect, the invention features a method for summarizing an original large data set with a representative data set. First, a first set of characteristic values describing relationships between a plurality of variables of original data elements is determined. Second, for each of the first set of characteristic values, a statistical representation over the set of original data elements is determined. The statistical representation may be an average value, a weighted average value having more weight assigned to more important or more accurate data elements, a sum of values, or moments of the data set. The statistical representation may be updated when new data elements are received. Finally, the set of representative data elements is generated so that the statistical representation of each of a second set of characteristic values over the set of representative data elements is substantially similar to the statistical representation corresponding to each of the first set of characteristic values. The second set of characteristic values describe relationships between a plurality of variables of representative data elements, similar to the first set of characteristic values. The representative data set may be generated by correlating a Taylor series approximation of the set of original data elements to a Taylor series approximation of the set of representative data elements, using a Newton-Raphson iterative scheme.
Some data sets are very large. Consequently, the invention additionally features a method for assigning original data elements to groups prior to determining the statistical representation. Original data elements having a common characteristic may be assigned to the same group, such as a common value of one or more categorical variables. The common characteristic could additionally or alternatively include a common range of values of one or more quantitative variables. The common characteristic could additionally or alternatively include the value of a statistical characteristic of a plurality of quantitative variables. In such case, the quantitative variables may need to be standardized. One method is to subtract the mean value from each variable and divide by the standard deviation. The statistical characteristic could then include a range of distances from the origin to the standardized point, and/or the value of a standardized variable relative to the values of the remaining standardized variables.
In general, in a second aspect, the invention features a method including assigning original data elements to a group, determining moments of the data elements assigned to a group, and generating representative weighted data elements having substantially similar moments as the moments of the original data elements.
In general, in a third aspect, the invention features a computer program product, including a storage device containing computer readable program code. The program code may include code for performing the above-mentioned methods.
In general, in a fourth aspect, the invention features a data structure having representative data elements. Each representative data element has a plurality of quantitative variables and an associated weight variable. The quantitative variables and the weight variable are combinable as representative weighted moments. The weighted moments are substantially similar to moments of a plurality of original data elements. There are fewer representative data elements than original data elements. The sum of the weight variables of every representative data element represents the number of original data elements.
In general, in a fifth aspect, the invention features a data processing system having a memory for storing statistical information about a set of original data elements, each original data element having a plurality of variables. The system also includes an application program for generating a data structure having a plurality of representative data elements. The representative data elements have the same variables as the original data elements, with the addition of a weight variable. Statistical information about the weighted representative data elements is substantially similar to the statistical information about the set of original data elements. There are fewer representative data elements than original data elements. The statistical information may include moments of varying order.