Embodiments of the present invention relate generally to methods and systems for representing content expressing in a spoken language and more particularly to representing a plurality of languages in a lexicon based on a unified language model.
Language modeling is used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, etc. Language modeling is also used in other processes such as searching, information storage and retrieval, data compression, etc. For example in language processing, the language model can be used to identify words in a sequence. In information retrieval, the language model can be used to rank or order retrieved documents based on the probability that the documents' contents would match the terms of a query.
One example of a language model is the “bag-of-words” model. The bag-of-words model is a simplifying representation in which a text is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. However, current approaches to language modeling, including the bag-of-words model, focus on individual languages. That is, the model for one language is exclusive to that particular language. When dealing with more than one language, a different model is needed for each language. This creates inefficiencies and overhead when dealing with multiple languages. Hence, there is a need for improved methods and systems for representing content expressing in a spoken language.