Many online activities are associated with a particular geographic location. For example, people may generate a personal web log (“blog”) that provides accounts of their recent trips, read news articles relating to events in their local communities, and search the web to find local restaurants. The identification of the geographic location associated with a document (e.g., web page, blog, or query) is useful in many applications. For example, many location-based web applications have been developed to support mobile devices and local searching needs. Such location-based web applications include navigation systems, location-based search systems, local advertisement systems, geographic retrieval systems, and so on. These web applications typically need to detect a geographic location of a web resource and match it with the user's current location. For example, a cellular phone user may want to find a restaurant that is near the user's current location. A web application could match the user's current location as indicated by the cellular phone with the location of restaurants to identify which restaurants may be nearby.
Although some web pages have been manually annotated with metadata that describes the associated geographic locations, most web pages have not. Various techniques have been developed to mine the geographic location of documents. Such techniques are generally based on gazetteers and disambiguation algorithms. For example, one technique may extract locations by looking up every word of the document in a gazetteer to identify the words that correspond to that location. Such techniques, however, have problems. For example, many geographic terms may have nongeographic meanings. The word “Java” may represent an island in Indonesia, a programming language, a coffee brand, a French band, and so on. As another example, many locations share the same name. In the United States, there are at least 17 cities named “Columbus.” As another example, documents may also contain geographic locations that are not of interest, such as copyright information about the location of a publisher. Some geographic location information may be implicit in words of the document that do not directly correspond to a geographic location and thus would not show up in a gazetteer. For example, the word “Sunni” may have strong implicit correlation to the location of Iraq.
Many applications may use the topics of a document in their processing. For example, the topics can be used by search applications and document summarization application. Many different techniques have been developed to identify the topics of documents. The Latent Dirichlet Allocation (LDA) is a technique that identifies a fixed number of latent topics in a collection of documents based on similarity of words of the documents. Each word in the collection of documents has a probability of being related to each of the latent topics. For example, if the fixed number of latent topics is five, the word “magic” may have probabilities of 0.02, 0.04, 0.01, 0.01, and 0.02 for each of the latent topics. Based on the probabilities of the words of a document, each document has a probability of being related to each of the latent topics. A document may have the probabilities of 0.2, 0.1, 0.1, 0.1, and 0.5 for each of the latent topics. Given a collection, LDA learns the probability that each word of the collection relates to each latent topic and the probability that a document in the collection relates to each topic. LDA uses the learned probabilities to calculate the probability that a document not in the collection relates to each of the latent topics.
More specifically, LDA provides a generative probabilistic model of a collection of documents based on a Dirichlet distribution. The documents of the collection are represented as random mixtures over latent topics, and each topic is characterized by a distribution of words. Using inference techniques, LDA learns from a collection of documents parameters for the model representing the relationship between words of the documents and the latent topics. The parameters of the model include k representing the number of latent topics, βij representing the probability that word wj is related to topic zi, θ representing a Dirichlet random variable that is a k-element vector of the probability that a document relates to each of the k topics, and α representing a k-element vector indicating the probability that a document within the collection relates to each of the k topics. FIG. 1 is a graphical representation of the LDA model. The rectangle 101 represents the collection of D documents, and the rectangle 102 represents the N words in a document. As shown, there is one β and one a for the entire collection, one θ for each document, and one topic z for each word. LDA represents the probability distribution for the parameter θ as follows:
      p    ⁡          (              θ        |        α            )        =                    Γ        (                              ∑                          i              =              1                        k                    ⁢                      α            i                          )                              ∏                      i            =            1                    k                ⁢                  Γ          ⁡                      (                          α              i                        )                                ⁢          θ      1                        α          1                -        1              ⁢                  ⁢    …    ⁢                  ⁢          θ      k                        α          k                -        1            where Γ(x) represents the Gamma function. LDA represents the joint distribution of a topic mixture as follows:
      p    ⁡          (              θ        ,        z        ,                  w          |          α                ,        β            )        =            p      ⁡              (                  θ          |          α                )              ⁢                  ∏                  n          =          1                N            ⁢                        p          ⁡                      (                                          z                n                            |              θ                        )                          ⁢                  p          ⁡                      (                                                            w                  n                                |                                  z                  n                                            ,              β                        )                              where p(zn|θ) represents θi for the unique i such that zni=1. This joint distribution represents the probability of any combination of θ, z, and w given α and β where w represents a vector of words of the document and z represents a vector with a topic for each word of the document. LDA represents a marginal distribution of a document as follows:
      p    ⁡          (                        w          |          α                ,        β            )        =      ∫                  p        ⁡                  (                      θ            |            α                    )                    ⁢              (                              ∏                          n              =              1                        N                    ⁢                                    ∑                              z                n                                      ⁢                                          p                ⁡                                  (                                                            z                      n                                        |                    θ                                    )                                            ⁢                              p                ⁡                                  (                                                                                    w                        n                                            |                                              z                        n                                                              ,                    β                                    )                                                                    )            ⁢              ⅆ        θ            The marginal distribution represents the probability of the document w given α and β. LDA represents the probability of the collection as the product of the marginal probabilities of the documents as follows:
      p    ⁡          (                        D          |          α                ,        β            )        =            ∏              d        =        1            M        ⁢          ∫                        p          ⁡                      (                                          θ                d                            |              α                        )                          ⁢                  (                                    ∏                              n                =                1                                            N                d                                      ⁢                                          ∑                                  z                  dn                                            ⁢                                                p                  ⁡                                      (                                                                  z                        dn                                            |                                              θ                        d                                                              )                                                  ⁢                                  p                  ⁡                                      (                                                                                            w                          dn                                                |                                                  z                          dn                                                                    ,                      β                                        )                                                                                )                ⁢                  ⅆ                      θ            d                              where D represents the collection, M represents the number of documents in the collection, θd represents θ for document d, zdn represents z for word n of document d, and wdn represents w for word n of document d.
LDA estimates the parameters using a variational expectation maximization (“EM”) procedure. The procedure maximizes a lower bound with respect to variational parameters and then for fixed values of the variational parameters maximizes the lower bound with respect to the model parameters α and β. Once the parameters are learned, LDA can calculate the joint distribution for θ and z given w, α, and β as represented as follows:
      p    ⁡          (              θ        ,                  z          |          w                ,        α        ,        β            )        =            p      ⁡              (                  θ          ,          z          ,                      w            |            α                    ,          β                )                    p      ⁡              (                              w            |            α                    ,          β                )            Since the solution is computationally complex, LDA may use an approximation based on variational inference as described in Blei, D., Ng, A., and Jordan, M., “Latent Dirichlet Allocation,” Journal of Machine Learning Research, 3:993-1022, January 2003. Thus, given a document, LDA can apply this equation to determine the probability distribution of the topics for the document and for each word within the document.