In many practical situations, sections of text with different patterns of vocabulary usage may refer to the same subject matters, whereas they may use different key terms to express the same meanings. For example, different regional dialects of the same language, different levels of formality or technicality in discourse, different styles of writing, represent such differences in vocabulary usage, all of which we may refer to as dialects in a broader sense. An important problem then is: given query terms in one dialect, how can one reliably return relevant sections in different dialects. Solving such a problem would have practical value in information retrieval, where searching for useful information in an unfamiliar domain can be a difficult task with differing key terminology. Examples of such situations are user manuals for different programming languages, user manuals for products of different brands, or course catalogues from different universities.
In statistics, latent Dirichlet allocation (LDA) is a generative model that attempts to find clusters of words known as topics by analyzing the cooccurence of words across documents. LDA and its extensions model each document as a mixture of topics, where each word is generated from one of the topics.
LDA is a generative model, in the sense that it specifies a probabilistic procedure to generate the words in documents. For a given document, a set of multinomial topic probabilities and a set of multinomial probabilities of words, given topics, are drawn from Dirichlet distribution prior probabilities. Then, for each word position in the document, a topic is drawn according to the document's topic probabilities; finally, a word is drawn according to that topic's word probability distribution. However, when observing data, the topic distribution of each document, the probability distribution of words given topics, and the topic that generated each word in the document are not known. Inference in LDA is generally the estimation of the posterior distributions of the topic probabilities in each document, the probabilities of the words given the topics, and the assignments of topics to each word.
Although. LDA itself is not intended to model the dialect dependencies, several extensions of it have been developed for this purpose.
Word-sense disambiguation methods using topic models attempt to learn a polysemantic word's hidden sense according to a predefined labelled hierarchy of words. Other models for multi-lingual corpora require aligned or syntactically similar documents. Other models work on unaligned documents, however, they model corresponding topics in different vocabularies. In comparison, our method is completely unsupervised and models dialects within shared vocabularies.
One related work in these respects is the “dialect topic model” (diaTM), which associates different documents in a corpus with different draws from both a mixture of dialects and a mixture of topics. We are considering applications where each corpus is associated with just one dialect and all corpora share a universal set of topics. However, each corpus can associate different terminologies to each topic. This would account for systematic changes in language across corpora (corresponding to dialects) without imposing differences in the topics. The structure of the “dialect topic model” does not facilitate the formulation of such constraints, as it allows each corpus to define different sets of topics.
Further related works are the topic-adapted latent Dirichlet allocation model (τLDA), which models a technicality hierarchy in parallel with the topic hierarchy, and the hierarchical latent Dirichlet allocation (hLDA) model, which models a tree structured hierarchy for the learned topics using the nested Chinese restaurant process. These models are best suited to address documents of differing levels of specificity (or “technicality”), which is not necessarily the case in the applications we consider.
Another problem with the above methods is that they are unable to directly identify the sets of equivalent terms which vary as a function of the dialect. This indicates a failure to precisely model the inherent constraints of the problem, and could lead to inaccurate results for information retrieval.