In the field of natural language processing, there is a demand for dealing with meanings behind words, not dealing with text data as merely symbol sequences. Recently, much attention has been given to devices that estimate latent topics (hereinafter referred to as the latent topic estimation device).
A topic is data representing a notion, meaning or field that lies behind each word. A latent topic is not a topic manually defined beforehand but instead is a topic that is automatically extracted on the basis of the assumption that “words that have similar topics are likely to co-occur in the same document” by taking an input of document data alone. In the following description, a latent topic is sometimes simply referred to as a topic.
Latent topic estimation is processing that takes an input of document data, posits that k latent topics lie behind words contained in the document, and estimates a value representing whether or not each of the words relates to each of the 0-th to (k−1)-th latent topics.
Known latent topic estimation methods include latent semantic analysis (LSA), probabilistic latent semantic analysis, and latent Dirichlet allocation (LDA).
Particularly LDA will be described here. LDA is a latent topic estimation method that assumes that each document is a mixture of k latent topics. LDA is predicated on a document generative model based on this assumption and can estimate a probability distribution in which words represent relations between latent topics in accordance with the generative model.
A word generative model in LDA will be described first.
Generation of documents in LDA is determined by the follow two parameters.α_{t}β_{t,v}
α_{t} is a parameter of a Dirichlet distribution that generates a topic t. β_{t, v} represent the probability of a word v being chosen from a topic t (the word topic probability). Note that _{t, v} represents that subscripts t, v are written below β.
A generative model in LDA is a model for generating words by using the following procedure in accordance with these parameters. The generative model first determines a mixture ratio θ_{j, t} (0≦t<k) of latent topics in accordance with the Dirichlet distribution of parameter α for a document. Then, generation of a word is repeated a number of times equivalent to the length of the document in accordance with the mixture ratio. Generation of each word is accomplished by choosing one topic t in accordance with the topic mixture ratio θ_{j, t} and then choosing words v in accordance with the probabilities β_{t, v}.
LDA allows α and β to be estimated by assuming such a generative model as described above and giving document data. The estimation is based on the maximum likelihood principle and is accomplished by computing α_{t} and β_{t, v} that are likely to replicate a set of document data.
LDA differs from the other latent topic estimation methods in that LDA deals with latent topics of a document using the mixture ratio θ_{j, t} and therefore a document can have multiple topics. A document written in a natural language often contains multiple topics. LDA can estimate word topic probabilities more accurately than the other latent topic estimation methods can do.
NPL 1 describes a method for estimating α, β one at a time (every time a document is added). A latent topic estimation device to which the method described in NPL 1 is applied repeats computations of the parameters given below when a document j is given, thereby estimating word topic probability β. FIG. 9 is a diagram illustrating an exemplary configuration of the latent topic estimation device to which the method described in NPL 1 is applied.
The latent topic estimation device illustrated in FIG. 9 repeats computations of the following parameters in order to estimate β.
γ_{j, t}^{k}
φ_{j, i, t}^{k}
n_{j, t}^{k}
n_[j, t, v]^{k}
γ_{j, t}^{k} is a parameter (document topic parameter) on a Dirichlet distribution representing the probability of a topic t appearing in document j. Note that ^{k} represents that the superscript k is written above γ. φ_{j, i, t}^{k} is the probability (document word topic probability) that the i-th word in document j being assigned to topic t. n_{j, t}^{k} is an expected value of the number of times assignments to topic t in document j (the number of document topics). n{j, t, v}^{k} is an expected value indicating the number of times a word v is assigned to topic t in document j (the number of word topics).
FIG. 9 illustrates a configuration of the latent topic estimation device and focuses only on estimation of word topic probabilities β.
The latent topic estimation device illustrated in FIG. 9 includes a document data addition unit 501 adding document data that includes one or more words and is input by a user operation or an external program, a topic estimation unit 502 estimating latent topics by repeatedly computing document word topic probability in accordance with a generative model premised on a mixture distribution of topics for an added document, a topic distribution storage unit 504 storing the number of word topics computed by the topic estimation unit 502, a data update unit 503 updating data in the topic distribution storage unit 504 on the basis of the number of word topics computed by the topic estimation unit 502, and a word topic distribution output unit 505 which, when called by a user operation or an external program, computes word topic probability on the basis of the number of word topics in the topic distribution storage unit 504 and outputs the result.
A flow of processing in the latent topic estimation device illustrated in FIG. 9 will be described below. FIG. 10 is a flowchart illustrating topic estimation processing performed in the latent topic estimation device illustrated in FIG. 9.
First, when a document including one or more words is added to the document data addition unit 501, the latent topic estimation device illustrated in FIG. 9 starts the processing. The added document is input into the topic estimation unit 502. The topic estimation unit 502 checks words in the document data in sequence and repeatedly updates the document word topic probability, the number of document topics, the number of word topics, and the document topic parameter to perform probability estimation.
The processing by the topic estimation unit 502 illustrated in FIG. 10 will be described by using Equations 1 to 4, 2′ and 4′.
                    [                  Math          .                                          ⁢          1                ]                                                                      Φ                      j            ,            i            ,            t                    k                =                                            β                              t                ,                                  w                                      j                    ,                    i                                                              k                        ⁢            exp            ⁢                          {                              ψ                ⁡                                  (                                                            γ                                              j                        ,                        t                                            k                                        -                                          ψ                      ⁡                                              (                                                                              ∑                                                          t                              =                              i                                                        T                                                    ⁢                                                                                                          ⁢                                                      γ                                                          j                              ,                              t                                                        k                                                                          )                                                                              )                                            }                                                          ∑              t                        ⁢                                          β                                  t                  ,                                      w                                          j                      ,                      i                                                                      k                            ⁢              exp              ⁢                              {                                  ψ                  ⁡                                      (                                                                  γ                                                  j                          ,                          t                                                k                                            -                                              ψ                        ⁡                                                  (                                                                                    ∑                                                              t                                =                                i                                                            T                                                        ⁢                                                                                                                  ⁢                                                          γ                                                              j                                ,                                t                                                            k                                                                                )                                                                                      )                                                  }                                                                        (                  Eq          .                                          ⁢          1                )                                [                  Math          .                                          ⁢          2                ]                                                                                                                                  n                                      j                    ,                    t                                    k                                =                                                      n                                          j                      ,                      t                                        old                                    -                                      ϕ                                          j                      ,                      i                      ,                      t                                        old                                    +                                      ϕ                                          j                      ,                      i                      ,                      t                                        k                                                              ⁢                                                                                                                                      n                                  j                  ,                  t                  ,                  v                                k                            =                                                n                                      j                    ,                    t                    ,                    v                                    old                                +                                                      (                                                                  ϕ                                                  j                          ,                          i                          ,                          t                                                k                                            -                                              ϕ                                                  j                          ,                          i                          ,                          t                                                old                                                              )                                    ⁢                                      I                    ⁡                                          (                                                                        w                                                      j                            ,                            i                                                                          =                        v                                            )                                                                                                                              (                  Eq          .                                          ⁢          2                )                                [                  Math          .                                          ⁢          3                ]                                                                                                                                  n                                      j                    ,                    t                                    old                                =                                                      ∑                    i                                    ⁢                                      ϕ                                          j                      ,                      i                      ,                      t                                        old                                                              ⁢                                                                                                                                      n                                  j                  ,                  t                  ,                  v                                old                            =                                                ∑                  i                                ⁢                                                      ϕ                                          j                      ,                      i                      ,                      t                                        old                                    ⁢                                      I                    ⁡                                          (                                                                        w                                                      j                            ,                            i                                                                          =                        v                                            )                                                                                                                              (                  Eq          .                                          ⁢                      2            ′                          )                                [                  Math          .                                          ⁢          4                ]                                                                      γ                      t            ,            v                    k                =                              α            t            k                    +                      n                          j              ,              t                        k                                              (                  Eq          .                                          ⁢          3                )                                [                  Math          .                                          ⁢          5                ]                                                                      β                      t            ,            v                    k                =                                            λ              0                        +                          A                              k                ,                t                ,                v                                      +                          n                              j                ,                t                ,                v                            k                                                          ∑              t                        ⁢                          (                                                λ                  0                                +                                  A                                      k                    ,                    t                    ,                    v                                                  +                                  n                                      j                    ,                    t                    ,                    v                                    k                                            )                                                          (                  Eq          .                                          ⁢          4                )                                [                  Math          .                                          ⁢          6                ]                                                                      β                      t            ,            v                    k                =                                            λ              0                        +                          n                              j                ,                t                ,                v                            old                                                          ∑              t                        ⁢                          (                                                λ                  0                                +                                  n                                      j                    ,                    t                    ,                    v                                    old                                            )                                                          (                  Eq          .                                          ⁢                      4            ′                          )            
When a document j made up of N_{j} words is added, the topic estimation unit 502 computes initial values of the following parameters (step n1).
φ_{j, i, t}^{old} (0≦t<k, 0≦i<N_{j})
n_{j, t}^{old} (0≦t<k)
n_{j, t, v}^{old} (0≦t<k)
γ_{j, t}^{k} (0≦t<k)
β_{t, v}^{k} (0≦t<k)
n_{j, t}^{old} is the initial value of the number of document topics and is computed according to Equation 2′. n_{j, t, v}^{old} is the initial value of the number of word topics and is computed according to Equation 2′. γ_{j, t}^{k} is the initial value of the document topic parameter and is computed according to Equation 3. β_{t, v}^{k} is the initial value of the word topic probability and is computed according to Equation 4′.
Note that φ_{j, i, t}^{old} is the initial value of the document word topic probability and is randomly assigned.
The function I (condition) in Equations 2 and 2′ returns 1 when a condition is satisfied, and otherwise returns 0. w_{j, i} represents the i-th word in document j.
Then, the topic estimation unit 502 performs processing for updating the values of φ_{j, i, t}^{k}, β_{t, v}^{k} and γ_{j, t}^{k} for each topic t (0≦t<k) for each word (step n2). The update processing is accomplished by computing Equations 1, 2, 3 and 4 in order.
In Equation 1, ψ(x) represents a digamma function and exp(x) represents an exponential function. A_{t, v} in Equation 4 is stored in the topic distribution storage unit 504. Note that when there is not a corresponding value in the topic distribution storage unit 504 at time such as the time the first document is added, 0 is assigned to A_{t, v}.
When the parameter updating for all of the words is completed, the topic estimation unit 502 replaces φ_{j, i, t}^{old}, n_{j, t}^{old}, and n_{j, t, v}^{old} with the values φ_{j, i, t}^{k}, n_{j, t}^{k}, and n_{j, t, v}^{k} computed at the current topic estimation in preparation for next update processing. Then the topic estimation unit 502 performs update processing in accordance with Equations 1 to 4 again for each word.
The topic estimation unit 502 then determines whether to end the processing (step n3). The number of iterations of step n2 performed after a document is added is stored and, upon completion of a certain number of iterations (Yes, at step n3), the topic estimation unit 502 ends the processing.
The data update unit 503 updates the value in the topic distribution storage unit 504 on the basis of the number of word topics n_{j, t, v} among the values computed by the topic estimation unit 502. The update is performed according to Equation 5.
[Math. 7]Ak,t,v=Ak,t,v+nj,t,vk  (Eq. 5)
The word topic distribution output unit 505 is called by a user operation or an external program. The word topic distribution output unit 505 outputs β_{t, v} according to Equation 6 on the basis of the value in the topic distribution storage unit 504.
      [          Math      .                          ⁢      8        ]                                            β                          t              ,              v                        k                    =                                                    λ                0                            +                              A                                  k                  ,                  t                  ,                  v                                                                                    ∑                t                            ⁢                              (                                                      λ                    0                                    +                                      A                                          k                      ,                      t                      ,                      v                                                                      )                                                                          (                      Eq            .                                                  ⁢            6                    )                    
This method does not store all documents and does not repeat the estimation processing for all documents but repeats estimation only for an added document when the document is added. This method is known to be capable of efficient probability estimation and to operate faster than common LDA. However, the speed is not high enough, processing time proportional to the number of topics is required and therefore when the number of topics k is large, much time is required. This problem may be addressed by using hierarchical clustering.
NPL 2 describes a hierarchical clustering method. In order to estimate the latent topic in the document data, this method recursively performs processing in which clustering (=topic estimation) with two clusters (=the number of topics) is performed to divide data into two. This enables topics to be assigned to each document on the order of log(K). Although the method is a technique in a similar field, the method is merely a technique to assign topics to documents and cannot estimate the probability of a topic being assigned to a word. Furthermore, a single topic is assigned to each piece of data and a mixture of multiple topics cannot be represented.