Generally, the technique for recognizing speech on an unspecified person is called speaker independent speech recognition, and the technique for recognizing speech on a specified person is called speaker dependent speech recognition.
As one of methods for recognizing speech, for example, speech is recognized using a speech model modeled with a speech parameter for each phoneme with phonemes composing one word defined as a speech unit. Taking a word of “Hokkaido” as an example, a speechmodel of “Hokkaido” is created with a network having nine phonemes of “h”, “o”, “ts”, “k”, “a”, “i”, “d”, “o” and “u” linked in series. In addition, for recognition of another word such as “Aomori” or “Akita”, a speech model matching the word should be prepared. In the case of speaker independent speech recognition, this speech model is modeled with speech parameters common to many persons.
As a technique for speaker independent speech recognition using the speech model of phonemes, Hidden Markov Model (hereinafter referred to simply as HMM) has been generally known, and is described in detail in, for example, “Digital signal Processing of Speech/Sound Information” (under joint authorship of Kiyohiro Shikano, Tetsu Nakamura and Shiro Ise, SHOKODO CO., LTD.).
A method for speaker independent speech recognition by HMM will be briefly described with reference to FIGS. 7, 8A and 8B. FIG. 7 shows a phoneme set with phonemes classified into predetermined sections. FIGS. 8A and 8B show a concept of a speech model modeled with a network of phonemes linked in series.
According to HMM, in the case of Japanese language, one word is first composed by a network of phonemes linked in series using any phonemes of vowels, fricative sounds, affricative sounds, plosive sounds, semivowels and nasal sounds, as shown in FIG. 7. A state transition matching the word is created and for each state, a transition probability representing a probability of making a transition to a next state, and an output probability representing a probability of outputting a speech parameter when making a transition to the next state are specified, whereby a speech model is created. For example, the speech model for the word of “Hokkaido” can be modeled with a network of nine phonemes linked in series in the order of speaking as shown in FIG. 8A. The state transition of HMM of each phoneme is shown in FIG. 8B.
Here, a(I,J) in FIG. 8B shows a transition probability from state I to state J and for example, a(1,1) in the figure shows a transition probability from state 1 to state 1.
Furthermore, b(I,x) shows an output probability in state I when the speech parameter x is obtained, and b(I,x) in the figure shows an output probability of state 1 when the speech parameter x is obtained.
Furthermore, p(I) in FIG. 8B shows a probability of state I, and is expressed by the following formula (1).P(I)=max(p(I)×a(I.I),(I−1)×a(I−1.I))×b(I.X)  (1)
In the above formula (1), “max” is a function selecting a maximum value of arguments.
Recognition of speech having a relatively long word sequence using a plurality of speech models like this will now be described in detail with reference to FIG. 9. Examples thereof include recognition of speech of a word sequence having the name of a prefecture and the name of a city, town or village linked like an address or the like. FIG. 9 shows the configuration of a speech model network 500.
As shown in FIG. 9, the speechmodel network is comprised of a pose 502 detecting a silent portion of input speech, a speech model group 504 having grouped a plurality of speech models capable of recognizing speech of the names of prefectures, a speech model group 506 having grouped a plurality of speech models capable of recognizing speech of the names of cities under prefectures, a speech model group 508 having grouped a plurality of speech models capable of recognizing speech of the names of wards or towns under cities, a speech model group 510 having grouped a plurality of speech models capable of recognizing the names of districts under wards or towns, and a pose 512 detecting a silent portion of input speech.
The speech model group 504 has grouped speech models corresponding to prefectures and capable of recognizing speech of the names of the prefectures, and is linked to the pose 502.
The speech model group 506 has grouped speech models corresponding to cities and capable of recognizing speech of the names of the cities, and is linked to speech models belonging to the speech model group 504. In the example of FIG. 9, the speech model group 506 having grouped speech models capable of recognizing speech of the names of cities belonging to Kanagawa prefecture is linked to one of speech models belonging to the speech model group 504, which is capable of recognizing speech of Kanagawa prefecture.
The speech model group 508 has grouped speech models corresponding to wards or towns and capable of recognizing speech of the names of the wards or towns, and is linked to speech models belonging to the speech model group 506. In the example of FIG. 9, the speech model group 508 having grouped speech models capable of recognizing speech of the names of towns belonging to Fujisawa city is linked to one of speech models belonging to the speech model group 506, which is capable of recognizing speech of Fujisawa city.
The speech model group 510 has grouped speech models corresponding to districts and capable of recognizing speech of the name of districts, and is linked to speech models belonging to the speech model group 508. In the example of FIG. 9, the speech model group 510 having grouped speech models capable of recognizing speech of the names of districts belonging to North ward is linked to one of speech models belonging to the speech model group 508, which is capable of recognizing speech of North ward.
The pose 512 is linked to the speech model group 508 or speech model group 510.
Furthermore, in these link relationships, as a speech parameter is given, a change in occurrence probability is propagated in the order of the pose 502, the speech model group 504, the speech model group 506, the speech model group 508, the speech model group 510 and the pose 512, or in the order of the pose 502, the speech model group 504, the speech model group 506, the speech model group 508 and the pose 512.
In this way, for speaker independent speech recognition, a plurality of speech models are prepared in advance, and the speech models are placed in a memory such as a RAM to recognize speech.
In this method, however, as the number of linked words increases, the number of words explosively increases with words combined, the memory capacity required for speech recognition processing by the Viterbi algorithm or the like thus increases, and in a built-in system such as a car navigation, the memory capacity constituting the system increases. For example, if the name of a place in Japan is recognized, the number of words to be recognized is about 3500 in the speech model network capable of recognizing speech of a word sequence having the name of a prefecture followed by the name of a city, town or village, while the number of words to be recognized is greater than a hundred of thousands in the speech model network capable of recognizing a word sequence having the names of a prefecture and a city, town or village followed by the name of a ward, county or the like.
The present invention has been made in view of the unsolved problems of the prior art, and has, as its object the provision of a speech recognition device which can preferably used for reducing the memory capacity required for speaker independent speech recognition.