This invention relates generally to data modeling, and more particularly to efficient data modeling utilizing a sparse representation of the initial data set.
Data modeling has become an important tool in solving complex and large real-world computerizable problems. For example, a web site such as www.msnbc.com has many stories available on any given day or month. The operators of such a web site may desire to know whether there are any commonalties associated with the viewership of a given set of programs. That is, if a hypothetical user reads one given story, can with any probability it be said that the user is likely to read another given story. Yielding the answer to this type of inquiry allows the operators of the web site to better organize their site, for example, which may in turn yield increased readership.
For problems such as these, data analysts frequently turn to advanced statistical tools. One such tool is the Expectation Maximization algorithm (hereinafter, the EM algorithm), known within the art, many times used in conjunction with a naxc3xafve-Bayes (or, Bayesian) model, as also known within the art. The EM algorithm permits a data analyst to assign each data record (e.g., the stories a given user has read) to a cluster of like data records, where each cluster has a probabilistic dependence for each story. Thus, assignment of a user into a cluster is useful to predict what other stories or types of stories a user may have also read, or may read in the future. (Generally, a cluster is a collection of records, where each record has a similar set of attribute values.)
A disadvantage to utilization of the EM algorithm is that generally, as the size of the data set increases, the running time of the algorithm increases even moreso. This is because the EM algorithm typically cycles over the records in a data set many times, in an iterative manner, until a given convergence criterion is met. This is problematic for problems such as the web site example just described, because typically the data set can run into the millions of records, impeding timely analysis thereof. Thus, a data analyst may not utilize the EM algorithm as much as he or she would like to.
For these and other reasons, there is a need for the present invention.
The invention relates to efficient data modeling utilizing a sparse representation of the data set. In one embodiment, a data set is first input. The data set has a plurality of records. Each record has at least one attribute, where each attribute has a default value. The attributes of one record can in one embodiment largely overlap with the attributes of another record, although the specific values of the attributes for one record may not so overlap. The method stores a sparse representation of each record, such that the value of each attribute of the record is stored only if it varies from the default value (that is, if the value equals the default value, it is not stored). A data model is then generated, utilizing the sparse representation, and the model is output.
Embodiments of the invention, by utilizing the sparse representation summarized in the preceding paragraph, provide for more efficient and less time-intensive data modeling of the data set. For example, in the context of the EM algorithm, the determination of a joint probability distribution over the cluster membership and the record values is generally required to be made for each record. This typically requires determination of a posterior probability for each attribute of each record, which in the prior art can be a time-consuming process. As a further example, embodiments of the invention provide for speeded-up clustering, as can be appreciated by those of ordinary skill within the art.
However, under embodiments of the invention, determination of a posterior probability can involve only initially generating a joint probability based on the default value of an attribute, and determining the posterior probability based on this joint probabilityxe2x80x94such that this posterior probability is used for any record having the default value for this attribute. In other words, the posterior probability does not have to be generated anew for every record with this attribute, but rather once for the default value of the attribute, and then subsequently only for those records having a value for this attribute varying from the default value. This greatly decreases the amount of time it takes for the EM algorithm to be run.
Thus, embodiments of the invention allow for use of the EM algorithm to cluster a sparse data set, without examining the value of every attribute of every record. In instances where each record contains, on average, only a small set of attributes that are not in their default state, embodiments of the invention can yield great decreases in run time.
The invention includes computer-implemented methods, machine-readable media, computerized systems, and computers of varying scopes. Other aspects, embodiments and advantages of the invention, beyond those described here, will become apparent by reading the detailed description and with reference to the drawings.