Large amounts of data are increasingly being collected for a wide variety of reasons. On the Internet alone, web browsing habits and electronic commerce activities of consumers result in large amounts of data that can be analyzed. For example, a web site operator may want to analyze data concerning the web browsing habits of its visitors to improve its web site and make the site more popular. As another example, an electronic commerce provider may want to analyze data concerning the purchases of consumers to increase its overall revenue and profit.
For such data to be analyzed, usually a statistical, probabilistic, or another type of model is constructed. In technical terms, after the type of model is selected, one or more model parameters are determined from the data. For a given type of model, the model parameters govern how the model operates with respect to the data. The data in this case can refer to observed data, as opposed to generated or unobserved data, because the data has been collected from real-world interactions. After being constructed, the resulting model is queried or otherwise used to provide desired answers regarding the data. Use of a model in this way is generally known as data mining.
Frequently, the observed data that is used to construct a model is incomplete. This means that for a number of records of the data, the values for one or more variables for some records are missing. For example, data concerning the web browsing habits of web site visitors may include records corresponding to the individual visitors. Each visitor ideally has indicated whether a given web page was visited by him or her. The variables correspond to the web pages for which visitation is being tracked. The variables may be binary, in that each of them has two possible values. The first value corresponds to the web page being visited, whereas the second value corresponds to the web page not being visited. For each visitor, there ideally is a value for each variable corresponding to the visitation or non-visitation of a given web page. However, where the data is incomplete, some of the visitors may have missing values for some of the variables.
Constructing a model from incomplete data is more difficult than constructing the model from data having no missing values. More technically, determining the parameters for a selected type of model is not easily accomplished from incomplete data. For this reason, an existing approach to model parameter estimation in light of the incomplete observed data may be used.
One well known approach to parameter estimation for incomplete data is the expectation maximization (EM) algorithm. The EM algorithm is described in detail in the reference Dempster et al., Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, Series B, 39, 1-38 (1977). The EM algorithm is also referred to as the standard EM algorithm, and is outlined as the method 100 of FIG. 1. The algorithm includes two steps, an expectation step, or E-step (102), and a maximization step, or M-step (104). The EM algorithm is an iterative algorithm. For each iteration of 102 and 104, the algorithm improves the parameters. After each iteration, the algorithm determines whether a given convergence criterion has been satisfied (106). If the criterion has been satisfied, then the algorithm is done (108). Otherwise, the algorithm continues with another iteration (110, 102 and 104).
More technically, the EM algorithm estimates parameters, θ, based on incomplete data, y. The EM algorithm is a maximization procedure for a function ƒ(θ|y), where the actual function ƒ(θ|y) depends on the type of estimation being performed. For example, for what is known as maximum likelihood estimation, the function is a log-likelihood function. For what is known as maximum a posterior estimation, the function is a log-posterior function.
In the E-step of 102, a conditional expectation is constructed based on the current estimated parameters and the incomplete observed data. That is, the E-step finds expected completions of the incomplete data given the current parameters to construct the conditional expectation for the complete data model under consideration. The expected completion, or conditional expectation, is known as the Q function. For the nth iteration, the Q function is mathematically written as Q(θ|θn,y). This notation reflects that the Q function is a function of the parameters, θ, and that it is constructed based on the current parameters, θn, as well as the observed incomplete data, y.
In the M-step of 104, the conditional expectation is maximized to obtain new parameters. That is, the M-step maximizes the Q function with respect to the parameters, θ, to obtain the parameters θn+1 that are used in the next iteration. In FIG. 1, the setting of the current parameters to the new parameters is performed in 110 prior to the next iteration of the E-step in 102 and the M-step in 104.
A primary disadvantage to using the EM algorithm to obtain the parameters for a model from incomplete observed data is that it is not useful where the observed data is large. More technically, the EM algorithm is computationally impractical for large amounts of incomplete observed data. This is because the algorithm often requires many iterations before convergence is reached, with each iteration including an E-step. The time necessary to perform the E-step depends linearly on the number of records in the observed data. That is, as the amount of data increases linearly, the amount of time necessary to perform the E-step also increases linearly.
To overcome this disadvantage, incremental-type EM algorithms can alternatively be used. Incremental-type EM algorithms only perform the E-step for a small subset of the incomplete observed data, referred to as a block of data. The EM algorithm as a result becomes tractable even for large amounts of incomplete data The most common incremental-type EM algorithm is known simply as the incremental EM algorithm. The incremental EM algorithm is described in detail in the reference Neal and Hinton, A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants, in Learning in Graphical Models, Jordan (ed.), Kluwer Academic Publishers: The Netherlands, pp. 355-371 (1998).
However, other kinds of incremental-type EM algorithms also exist and are known, such as the forgetful EM algorithm. The forgetful EM algorithm is described in detail in the reference Nowlan, Soft Competitive Adaptation: Neural Network Learning Algorithms Based on Fitting Statistical Mixtures, Ph.D. thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh (1991). It is also described in detail in the reference Sato and Ishii, On-line EM Algorithm for the Normalized Gaussian Network, Neural Computation, 12(2), 407-432 (2000).
The incremental EM algorithm is outlined as the method 200 of FIG. 2. The incremental EM algorithm first partitions the incomplete observed data into a number of blocks (202). The algorithm still includes an E-step (204), and an M-step (208). The algorithm is still iterative. For each iteration of 204, 206, and 208, the model parameters improve. After each iteration, the algorithm determines whether the convergence criterion has been satisfied (210). If the criterion has been satisfied, then the algorithm is done (212). Otherwise, the algorithm continues with another iteration of 204, 206, and 208.
More technically, in 202 the data, y, is partitioned into k blocks, y1, y2, . . . , yk. Preferably, the blocks are substantially the same size, although this is not necessary. The E-step in the nth iteration is performed in 204 with respect to a single block, yi, i=n mod k. This results in the construction of an incremental conditional expectation given the current model parameters, referred to as the incremental Q function, or the IQ function, IQ(θ|θn,yi). This is the fraction of the conditional expectation of the incomplete data, the Q function, with respect to the single block, yi. To determine the rest of the Q function, the IQ functions for the other blocks, yj, j≠i, previously determined are added in 206 to the IQ function just determined in 204.
The determination of the Q function in 206 can be written mathematically as:
                              Q          ⁡                      (                                          θ                ❘                                  θ                  ″                                            ,              y                        )                          =                              IQ            ⁡                          (                                                θ                  ❘                                      θ                    ″                                                  ,                                  y                  i                                            )                                +                                    ∑                              j                ≠                i                                      ⁢                                                  ⁢                                          IQ                ⁡                                  (                                                            θ                      ❘                                              θ                                                  previous                          ⁢                                                                                                                                                                      ,                                          y                      j                                                        )                                            .                                                          (        1        )            The term Q(θ|θn,y) in equation (1) is the Q function. The term IQ(θ|θn,yi) is the IQ function determined in 204. The term
      ∑          j      ≠      i        ⁢          ⁢      IQ    ⁡          (                        θ          ❘                      θ                          previous              ⁢                                                                                  ,                  y          j                    )      is the sum of the IQ functions for the blocks other than that for which the IQ function was determined in 204 in the current iteration n. These IQ functions include the IQ function most recently determined for each of the other blocks. Therefore, θprevious in IQ(θ|θprevious,yj) refers to a particular one of the previous k−1 parameterizations for which the IQ function was determined. Note that until the current iteration n is greater than the number of blocks k, the IQ functions for blocks j>n are zero because they have not yet been determined.
The E-step of 204 thus iterates through the blocks of data cyclically. The M-step of 208 is performed as before, maximizing the conditional expectation to obtain new parameters. That is, the M-step of 208 maximizes the Q function constructed in 206 with respect to the parameters, θ, to obtain the new parameters, θn+1, that are set as the current parameters in 214 and used in the next iteration. The E-step and M-step are repeated in successive iterations until the convergence criterion is satisfied.
The forgetful EM algorithm is similar to the incremental EM algorithm outlined as the method 200 of FIG. 2. The difference is that the Q function is not determined in 206 from a set of IQ functions, as indicated in equation (1). Rather, the Q function is determined as a decaying average of recently traversed data blocks. This is written mathematically as:Q(θ|θn,y)=IQ(θ|θn,yi)+γ·Q(θ|θn−1,y′).  (2)In equation (2), y′ is either all the incomplete data, or as much of the incomplete data as has been traversed so far. γ is the decay function. In performing the forgetful EM algorithm, the IQ functions do not need to be saved to determine the Q function, as they are in the incremental EM algorithm.
The incremental-type EM algorithms advantageously overcome the large data problem of the standard EM algorithm. They do this by partitioning the data into a number of blocks, and performing the E-step with respect to only one block of data in each iteration. However, the incremental-type EM algorithms do not provide guidance as to how the incomplete observed data should be initially partitioned into blocks. This means that to use one of the incremental-type EM algorithms a trial-and-error approach must usually be employed to determine the block size that yields sufficiently fast performance of the algorithm. Experience or intuition may govern the initial selection of the size of the blocks into which the data is partitioned, which can then be decreased or increased to learn whether algorithm performance is improved or worsened. Alternatively, the size of the blocks into which the data is partitioned may be initially and permanently set, without examining whether decreasing or increasing the size yields better algorithm performance.
Regardless of the exact prior art approach used to select the size of the blocks into which the incomplete observed data is partitioned, the process is less than ideal. If the statistician or other person responsible for constructing the model chooses the size of blocks poorly, perhaps due to inexperience, model construction will unnecessarily be slowed. Alternatively, the statistician or other person, even if experienced, may feel compelled to try many different block sizes, to yield the best performance of the selected incremental-type algorithm. This, too, causes model construction to be slowed. For these and other reasons, there is a need for the present invention.