This application is related to the cofiled, coassigned, and copending U.S. Application No. 09/298,600 which is entitled xe2x80x9cFast Clustering with Sparse Data,xe2x80x9d and is hereby incorporated by reference.
This invention relates generally to data modeling, and more particularly to extracting two-way counts utilizing a sparse representation of the initial data set.
Data modeling has become an important tool in solving complex and large real-world computerizable problems. For example, a web site such as www.msnbc.com has many stories available on any given day or month. The operators of such a web site may desire to know whether there are any commonalties associated with the viewership of a given set of programs. That is, if a hypothetical user reads one given story, can with any probability it be said that the user is likely to read another given story. Yielding the answer to this type of inquiry allows the operators of the web site to better organize their site, for example, which may in turn yield increased readership.
For problems such as these, data analysts frequently turn to advanced statistical tools. Such tools include building and analyzing statistical models such as naxc3xafve-Bayes models, decision trees, and branchings, which are a special class of Bayesian-network structures, all of which are known within the art. To construct these models, generally two-way counts must first be extracted from the source data. Two-way counts for a pair of discrete variables define, for each pair of states of the two variables (each pair of states being a unique pair of one variable having a given value and the other variable having another given value, such that no pair has the same values for the variables as does another pair), the number of records in which that pair of states occur in the data. In other words, the counts summarize the information that the data provides about the relationship between the two variables, assuming that this relationship is not influenced by the values for any of the other variables in the domain.
A disadvantage to extracting two-way counts is that generally, as the size of the data set increases, the running time to perform the extraction increases even moreso. This is problematic for problems such as the web site example just described, because typically the data set can run into the millions of records, impeding timely analysis thereof. Thus, a data analyst may not build models that are based on two-way counts extraction as much as he or she would like to.
For these and other reasons, there is a need for the present invention.
The invention relates to extraction of two-way counts utilizing a sparse representation of the data set. In one embodiment, a data set is first input. The data set has a plurality of records. Each record has at least one attribute, where each attribute has a default value. The method stores a sparse representation of each record, such that the value of an attribute of the record is stored only if it varies from the default value (that is, if the value equals the default value, it is not stored). A data model is then generated, utilizing the sparse representation. Generation of the data model includes initially extracting two-way counts from the sparse representation. Finally, the model is output.
In one embodiment, extracting the two-way counts from the sparse representation includes explicitly counting two-way counts only for values of the attributes that vary from the default values, and explicitly counting one-way counts also only for values of the attributes that vary from the default values. The remaining one-and two-way counts are then derived. For a data set where most attributes of most records are equal to default values, this embodiment of the invention greatly speeds the run time of extracting two-way counts, and, thus, greatly decreases the run time in which statistical models utilizing two-way counts can be generated.
The invention includes computer-implemented methods, machine-readable media, computerized systems, and computers of varying scopes. Other aspects, embodiments and advantages of the invention, beyond those described here, will become apparent by reading the detailed description and with reference to the drawings.