1. Field of the Invention
This invention relates generally to data processing systems and more specifically to incrementally adding terms to a trained probabilistic latent semantic analysis (PLSA) model.
2. Background
Latent Semantic Analysis (LSA) is often used before the classification or retrieval of information.
Probabilistic Latent Semantic Analysis (PLSA) has been shown empirically for a number of tasks to be an improvement to LSA. PLSA is a statistical latent class (or aspect) model that provides excellent results in several information retrieval related tasks. PLSA has the positive feature of assigning probability distributions over latent classes to both documents and terms. Most other clustering techniques do a hard assignment to only one class and/or cluster based only on one aspect (for example, only on terms or only on documents). Aspect models are also successfully used in other areas of natural language processing such as language modeling. PLSA addresses the problems of polysemy and synonymy in natural language processing. In synonymy, different writers use different words to describe the same idea. In polysemy, the same word can have multiple meanings. PLSA was first described by Thomas Hofmann in a paper entitled Probabilistic Latent Semantic Indexing, published in the Proceedings of SIGIR-99, pages 35-44 in 1999 and hereby included by reference in its entirety. Additional information about PLSA is disclosed in U.S. Pat. No. 6,687,696, that issued on Feb. 3, 2004 by Hofmann et al. entitled “System and method for personalized search, information filtering, and for generating recommendations utilizing statistical latent class models,” also hereby included by reference in its entirety.
The PLSA model is fitted to training data by using the Expectation Maximization (EM) algorithm. PLSA assigns probability distributions over latent classes to both terms and documents and thereby allows terms and/or documents to belong to more than one latent class, rather than to only one class as is true of many other classification methods. PLSA models represent the joint probability of a document d and a word w based on a latent class variable z as:
                              P          ⁡                      (                          d              ,              w                        )                          =                              P            ⁡                          (              d              )                                ⁢                                    ∑              z                        ⁢                                          P                ⁡                                  (                                      w                    ⁢                                          ❘                                        ⁢                    z                                    )                                            ⁢                              P                ⁡                                  (                                      z                    ⁢                                          ❘                                        ⁢                    d                                    )                                                                                        (        1        )            that can be read as the probability of document d term w P(d,w), is equal to the probability of document d, P(d), times the sum over the latent classes of the probability of term w given latent class z, P(w|z), times the probability of latent class z given document d, P(z|d).
The PLSA model for generation is: a document within a document collection (d∈D) is chosen with probability P(d). For each word in document d, a latent class within the latent class collection (z∈Z) is chosen with probability P(z|d) that in turn is used to choose a term from the term collection (w∈W) with probability P(w|z).
A PLSA model is trained by fitting the model to the document collection D by maximizing the log-likelihood function L:
                    L        =                              ∑                          d              ∈              D                                ⁢                                    ∑                              w                ∈                d                                      ⁢                                          f                ⁡                                  (                                      d                    ,                    w                                    )                                            ⁢              log              ⁢                                                          ⁢                              P                ⁡                                  (                                      d                    ,                    w                                    )                                                                                        (        2        )            Where f(d,w) represents the number of times w occurs in d.
Maximization can be done by applying the expectation step (the E-step) in the EM-algorithm. The E-step is:
                              P          ⁡                      (                                          z                ⁢                                  ❘                                ⁢                d                            ,              w                        )                          =                                            P              ⁡                              (                                  z                  ⁢                                      ❘                                    ⁢                  d                                )                                      ⁢                          P              ⁡                              (                                  w                  ⁢                                      ❘                                    ⁢                  z                                )                                                                        ∑                              z                ′                                      ⁢                                          P                ⁡                                  (                                                            z                      ′                                        ⁢                                          ❘                                        ⁢                    d                                    )                                            ⁢                              P                ⁡                                  (                                      w                    ⁢                                          ❘                                        ⁢                                          z                      ′                                                        )                                                                                        (        3        )            
The maximization step (the M-step) is:
                              P          ⁡                      (                          w              ⁢                              ❘                            ⁢              z                        )                          =                                            ∑              d                        ⁢                                          f                ⁡                                  (                                      d                    ,                    w                                    )                                            ⁢                              P                ⁡                                  (                                                            z                      ⁢                                              ❘                                            ⁢                      d                                        ,                    w                                    )                                                                                        ∑                              d                ,                                  w                  ′                                                      ⁢                                          f                ⁡                                  (                                      d                    ,                                          w                      ′                                                        )                                            ⁢                              P                ⁡                                  (                                                            z                      ⁢                                              ❘                                            ⁢                      d                                        ,                                          w                      ′                                                        )                                                                                        (        4        )                                          P          ⁡                      (                          z              ⁢                              ❘                            ⁢              d                        )                          =                                            ∑              w                        ⁢                                          f                ⁡                                  (                                      d                    ,                    w                                    )                                            ⁢                              P                ⁡                                  (                                                            z                      ⁢                                              ❘                                            ⁢                      d                                        ,                    w                                    )                                                                                        ∑                              w                ,                                  z                  ′                                                      ⁢                                          f                ⁡                                  (                                      d                    ,                    w                                    )                                            ⁢                              P                ⁡                                  (                                                                                    z                        ′                                            ⁢                                              ❘                                            ⁢                      d                                        ,                    w                                    )                                                                                        (        5        )            
The E and M steps are alternately applied in the EM algorithm.
The parameters in the trained PLSA model can be accessed to determine relationships (such as similarities) between the documents that are known to the model (for example, between any subset, including the entire set, of documents composed of the set of those documents that were in the initial training set and those documents that have been folded-in to the trained PLSA model after the model was trained). In addition, a new document (a query document) can be added to the trained PLSA model by the folding-in process (subsequently described). The probabilities that the query document matches other of the documents in the model can be determined from the model parameters once the query document has been folded-in. Folding-in does not add new terms from the query document into the trained PLSA model.
Equations 1-4 use a parameterization based on P(z|d) instead of the parameterization P(d|z) used in the Hofmann paper. However, the P(z|d) parameterization allows an identical representation of the EM-algorithm to be used for both training and folding-in and thus this representation is advantageous.
The parameters of the PLSA model can be accessed and used to prepare a result responsive to and/or based on the current state of the model. The parameters (for example, P(z|d)) can also be compared between documents to find similarities between the documents in the set of documents D. This allows high quality associations between documents not necessarily based on similar words, but based on similar concepts per the latent classes.
Folding-in is the process of computing a representation (for example, a smoothed term vector) for a query document q and adding that representation into the trained PLSA model. The folding-in process results in a new P(z|d) parameter related to the representation of the query document. New terms in the query document are not added to the trained PLSA model by the folding-in process. New terms in the query document (and the document representation itself) are only included in the prior art PLSA model when the model is re-retrained.
At the start of the training process, the parameters (P(w|z) and P(z|d)) are either initialized randomly, uniformly or according to some prior knowledge. One skilled in the art will understand that during training, the E and M-steps are repeatedly applied until the parameters stabilize. Stabilization can be determined by using some threshold on characteristics of the model, such as the change in the log-likelihood, L, (equation 2) or the change in the parameters, or by capping the number of iterations. One skilled in the art will understand that stabilization conditions can be empirically determined.
A folding-in process uses the parameters P(w|z) obtained by the training process to determine the probability of a latent class conditioned on a query document, P(z|q); that is, roughly, the extent to which the query document (for example, a document or query string q) is characterized by the “topic” represented by the latent class, for each latent class in the set of latent classes. Folding-in uses the EM-algorithm as in the training process; the E-step is identical, the M-step keeps all the P(w|z) constant and re-calculates P(z|q) (the probability of the latent class given the query document). Usually, only a small number of iterations of the E- and M-steps are sufficient for the folding-in process.
As can be seen in the M-step, calculation of each P(z|d) for folding-in does not require summation over all documents d in the training collection (as the summations in equation 5 are over w and z′ and not over d). In particular, it allows a query document q to be added independently of other documents d (known to the model) if the model parameters P(w|z) are kept constant.
The M-step for the model parameters P(w|z), however, requires summation over all documents d (as shown by equation 4). Therefore, a run over the complete training collection (including those documents that have been folded-in) is required if we want to update the trained PLSA model with new terms or to change an existing parameter. This is memory and time inefficient and thus can not be reasonably performed in an on-line or real-time environment.
A further disadvantage of the PLSA model can also be seen in the M-step. The PLSA model parameters satisfy:
                                          ∑                          w              ∈              W                                ⁢                      P            ⁡                          (                              w                ⁢                                  ❘                                ⁢                z                            )                                      =        1                            (        6        )            for each z, which implies P(w|z)=0 for each w∉ W. This is a sum (equal to one) over a set of trained terms, of a plurality of probabilities that each member within the set of trained terms belongs to a latent class z. Therefore, the model cannot incorporate new terms that did not occur in the original training set and thus do not exist in W. Thus, new terms from a folded-in document are ignored.
There are problems with the PLSA model; one problem is the inability to add new terms (words not seen in the training set) to the model without retraining; another problem is that updating model parameters requires a run over the complete training data as well as the folded-in data and this process is inefficient. In applications where new documents are continually being folded-in to the model (such as newspaper articles, technical articles, etc., as well as transcripts of communication intercepts) terms newly defined in the folded-in documents are not incorporated into the model until the model is retrained. Thus, the model parameters can become less accurate as new documents are folded-in to the model.
It would be advantageous to address these problems.