Speech recognition is the process of transforming human speech into text. In recent years, statistical models have been commonly used in speech recognition systems. Namely, if input speech is designated as X and an output character string is designated as W, then speech recognition is the process of outputting a word string W having the maximum posterior probability P (W|X) for input X. The posterior probability P (W|X) can be written as a formula. Specifically, according to Bayes' rule, it is given by (Eq. 1) below.
                              P          ⁡                      (                          W              |              X                        )                          =                                            P              ⁡                              (                W                )                                      ×                          P              ⁡                              (                                  X                  |                  W                                )                                                          P            ⁡                          (              X              )                                                          [                  Eq          .                                          ⁢          1                ]            
Here, the probability models producing the P (X|W) and P (W) in (Eq. 1) above are referred to, respectively, as “acoustic models” and “language models”. They are trained using large-scale electronic speech and language data collections called corpora. Among them, N-gram models, which predict the probability of occurrence of the next word from the preceding (n−1)th word, are widely used as language models and require large amounts of text for robust recognition.
In addition, in order to implement a high degree of recognition accuracy in speech recognition, it is desirable to train acoustic models and language models used for speech recognition on data collected in the same environment as the input speech environment. In acoustic models, speech data from the same speakers and data with the same type of acoustics (noise, etc.) are suggested as data collected in the same environment as the input speech environment. In addition, data identical in terms of input speech, discourse style, and topics is suggested in language models.
As far as discourse is concerned, for example, written language used in newspapers, etc. is different from the language people use in everyday conversation (spoken language). Thus, when input speech is obtained from news broadcasts, a high degree of recognition accuracy can be implemented if language model training is carried out using data obtained through similar oral presentations (relatively close to written language). In addition, when the input speech is composed of colloquial language, a high degree of recognition accuracy can be implemented by performing language model training on spoken language corpora.
Research into spoken language is actively pursued by various companies and research institutions. It should be noted that until recently corpora have been based on written language because it is difficult to build a spoken language corpus. However, large-scale corpora focused on spoken language, represented by the Corpus of Spontaneous Japanese (CSJ), etc., have been build in recent years, and they are now widely used for language model training.
Incidentally, both the written-language and spoken-language corpora mentioned above have been represented in standard language, and currently there are almost no comprehensive dialect corpora. For this reason, no language models directed to dialects have been created up till now and the method of their creation has been generally unknown.
However, dialects are made up of standard language vocabulary and vocabulary specific to the region where said dialects are used. In addition, a large portion of the vocabulary specific to the region can be paraphrased using the standard language vocabulary. It can also be said that the standard language vocabulary (and phrases) can be transformed into a different, dialect-containing vocabulary (and phrases).
Thus, when a language model for the task in question (target task) cannot be created, it is contemplated to utilize methods, whereby a language model for the target task is created using text data related to a generic task that is different from the target task (e.g. see Patent Document 1). Specifically, assuming that the generic task is standard language and the target task is dialect, it is believed that a language model directed to a dialect can be created by practicing the language model creation method disclosed in Patent Document 1.
Here, FIG. 17 is used to describe a language model training device (language model creation device) that performs the language model creation method disclosed in Patent Document 1. FIG. 17 is a block diagram illustrating the configuration of a conventional language model training device. The language model training device illustrated in FIG. 17 is the language model training device disclosed in Patent Document 1.
As shown in FIG. 17, the language model training device is made up of a target task language data storage section 101, generic task language data storage section 102, similar word pair extracting means 103, similar word string combining means 104, and language model generating means 105. The target task language data storage section 101 holds text data for the target task. The generic task language data storage section 102 holds text data for generic tasks including tasks different from the target task.
A conventional language model training device of this configuration, which is illustrated in FIG. 17, operates in the following manner. First of all, the similar word pair extracting means 103, similar word string combining means 104, and language model generating means 105 read the respective data used for language model training from the target task language data storage section 101 and generic task language data storage section 102.
Next, the similar word pair extracting means 103 uses a pre-defined distance measure to calculate an interword distance for an arbitrary combination of words contained in the data that has been read from them. Either cross-entropy or the Euclidean distance of n-gram occurrence probability can be used as the interword distance. Then, if the computed value of this interword distance is smaller than the pre-configured value, the similar word extracting means 103 sends this pair of similar words to the similar word string combining means 104. It should be noted that in the discussion below, in the pairs of similar words, words comprised in the target task text data are designated as wT and words comprised in the generic task text data are designated as WG.
Next, the similar word string combining means 104 retrieves the respective word strings of an arbitrary length stored in the target task language data storage section 101 and generic task language data storage section 102. The similar word string combining means 104 then looks up pairs of similar words W (WT, WG) read from the similar word pair extracting means 103 and decides whether or not the word WG, which belongs to the generic task, is contained in the target task word strings.
Then, if the generic task word WG is contained in the target task word strings, the similar word string combining means 104 replaces the generic task word WG with the word target task WT in the word strings. Furthermore, the similar word string combining means 104 decides whether or not the substituted word strings are present in the generic task or target task language data and sends the substituted word strings to the language model generating means 105 if they are not present.
Finally, the language model generating means 105 creates a language model using text data contained in the target task language data storage section 101, text data contained in the generic task language data storage section 102, and word string data sent from the similar word string combining means 104.
It is believed that using the language model training device illustrated in FIG. 17 enables the creation of a language model directed to a dialect due to the fact that dialectal text data is held in the target task language data storage section 101 and standard language text data is held in the generic task language data storage section 102.