In an automatic speech recognition (ASR) technique, a statistical language model plays an important role. The statistical language model is acquired by modeling appearance frequency information on a word or multiple words (hereinafter, also referred to as a “word string”) in a corpus that contains a large amount of natural language sentences.
Training a language model requires a training corpus collected from a field most matching with a target field (also referred to as a “target domain”) of an automatic speech recognition application. Constructing a training corpus requires an enormous amount of sentences in a target field (hereinafter, also referred to as a “target field corpus”). Unfortunately, the amount of natural language sentences associated with the target field is typically limited. Therefore, it is difficult to collect a large amount of corpora in the target field. In particular, where a target field is a specialized field (e.g., a financial field, a scientific field), it is further difficult to collect a corpus in the target field.
Typically, collecting a large amount of natural language training sentences requires a dictating operation where a person listens to an utterance in the target field and the person converts the utterance into a text sentence. However, since this operation is manually performed, the cost is high. Accordingly, the amount of text sentences easily acquired by a manual process is limited.
In such a situation, machine-readable documents that can be relatively easily collected can be used. For instance, enormous amounts of newspapers, crawled web text, or social networking services (e.g., Facebook®, Twitter®, Google+®, Myspace®, LinkedIn® and LINE® in the world, and, e.g., Mixi®, GREED, Mobage® and Ameba® in Japan) (hereinafter, also referred to as an “out-of-target-field corpus”). Techniques of selecting natural language sentences required for training a language model using such machine-readable documents have been developed.
However, it is insufficient to just increase the amount of natural language sentences. It is desirable to construct a language model from an appropriate natural language sentence in conformity with the target field of an application (e.g., automatic speech recognition application) to which the language model is applied.
Accordingly, training a language model using sentences contained in a small-scale corpus in the target field and an enormous amount of sentences in out-of-target-field corpora is a practical scenario.
Thus, selection of sentences from out-of-target-field corpora has been researched with using a statistical model estimated from corpora in the target field.
Japanese patent JP2012-78647A describes a language model training apparatus used together with means for storing a machine-readable corpus that stores a corpus containing multiple natural language sentences for training a language model suitable to a specific usage from the corpus. The apparatus includes: a template storing means for storing a word string template preliminarily prepared for the specific usage, a word string extracting means for extracting from the corpus a word string pattern matching with the word string template stored in the template storing means, a transformation means for transforming the word string pattern extracted by the word string extracting means on the basis of a transformational rule preliminarily prepared for generating word strings in a natural language having a form along with a preliminarily selected purpose, and a training means for training the language model using word strings output from the transformation means as training data.
Japanese patent JP2012-83543A describes a language model generating device including: a corpus analyzing means for analyzing text in a corpus including a set of world wide web (web) pages, an extracting means for extracting at least one word appropriate for a document type set according to a speech recognition target based on an analysis result by the corpus analyzing means, a word set generating means for generating a word set from the at least one word extracted by the extracting means, a web page acquiring means for causing a retrieval engine to perform a retrieval process using the word set generated by the word set generating means as a retrieval query of the retrieval engine on the Internet and acquiring a web page linked from the retrieval result, and a language model generating means for generating a language model for speech recognition from the web page acquired by the web page acquiring means.
David Guthrie et al., “A Closer Look at Skip-gram Modelling” describes a method of using skip-grams for solving the problem of data sparsity (Abstract). As indicated in “2-skip-bi-grams” and “2-skip-tri-grams” described in the section of “2. Defining skip-grams” on page 1222, according to skip-grams, one word in a word string is deleted, words before and after the deleted word are caused to be adjacent to each other, thereby making a bi-gram and a tri-gram.