The technology for recognizing registered words unique to a particular person is generally called a specific speaker speech recognition. In the specific speaker speech recognition, a task is done whereby a particular person registers his or her voice of words he or she wants to be recognized. Specifically, this task involves converting speech samples of words, which the speaker generates in advance by uttering these words, into a sequence of feature parameters (called templates) and accumulating the sequence along with word labels into a storage device such as memory or hard disk. Among known methods of converting speech samples into a sequence of feature parameters are a cepstrum analysis and a linear prediction analysis. They are detailed in “Digital Signal Processing of Speech/Sound Information” (by K. Kano, T. Nakamura and S. Ise, published by Shokodo). The specific speaker speech recognition matches a feature parameter sequence converted from the input speech against the feature parameter sequence stored in the storage device and outputs, as a recognition result, a word label that has a feature parameter sequence most similar to the one converted from the input speech.
A widely used method of comparing the feature parameter sequence stored in the storage device and the feature parameter sequence converted from the input speech is dynamic time warping (DTW) based on dynamic programming. This method is detailed in the “Digital Signal Processing of Speech/Sound Information.”
The technology for recognizing generic words common to unspecified persons is generally called an unspecified speaker speech recognition. In the unspecified speaker speech recognition, information on feature parameters of generic words common to unspecified speakers is stored in advance in a storage device, and thus there is no need to register the speech of words the user wants recognized as is required in the specific speaker speech recognition. Known methods of converting speech samples into a sequence of feature parameters include a cepstrum analysis and a linear prediction analysis as in the specific speaker speech recognition. Generating information on feature parameters of generic words common to unspecified speakers and comparing this information and the feature parameter sequence converted from the input speech are generally performed by a method using a Hidden Markov Model (HMM).
The unspecified speaker speech recognition is also detailed in the “Digital Signal Processing of Speech/Sound Information.” In the case of Japanese language, for example, it is assumed that speech units are each composed of a set of phonemes, which are described in chapter 2 of the “Digital Signal Processing of Speech/Sound Information”, and that individual phonemes are modeled by HMM. Table 1 shows a list of labels of phoneme set.
TABLE 1Vowela, i, u e oFricativef, z, s, zh, sh, hPlosive-fricativedz, ts, dh, chPlosiveb, p, d, t, g, kHalf-vowelw, r, yNasalm, n, ng
A phonetic sound of “CD” for instance can be modeled with a network of phoneme labels common to speakers (referred to as a generic word label sequence), as shown in FIG. 2A.
A phonetic sound of “MD” for instance can be modeled with a generic word label sequence shown in FIG, 2B. By preparing phoneme model data based on HMM and generic word label sequences, a person skilled in the art can construct an unspecified speaker speech recognizer using the Viterbi algorithm, which is described in chapter 4 of the “Digital Signal Processing of Speech/Sound Information.”
In the speech recognizer, there is a need for a function to identify a mixed vocabulary made up of registered words unique to a particular speaker and generic words common to unspecified speakers. For example, in car audio equipment, there is a need to control such devices as “CD” and “MD” with voice commands for safety reasons. Because these device names can be set commonly by unspecified speakers, this requirement can be met by the unspecified speaker speech recognition technology, eliminating the registration process required by the specific speaker speech recognition technology. This is advantageous in terms of user interface.
There is also a need for a capability to select and play a desired among a plurality of CDs inserted in a CD changer. In this case, titles and singer names of the CDs inserted in the CD changer are considered to differ depending on the user. Thus, the specific speaker speech recognition technology, rather than the conventional unspecified speaker speech recognition, must be applied. That is, the user needs to register through voice the title names and singer names of the CDs to be inserted in the CD changer in advance. If speech recognition can be performed on a mixed vocabulary consisting of device names such as “CD” or “MD” and CD title names and singer names, there is no need to switch between a mode that can identify the generic words common to unspecified speakers, such as “CD” or “MD”, and a mode that can identify the registered words unique to a particular speaker, such as CD title names and singer names. This is considered to be able to provide a user-friendly speech recognition function.
The specific speaker speech recognition has mostly used a DTW-based technique and the unspecified speaker speech recognition an HMM-based technique. One possible solution to the needs described above may be to combine the DTW-based specific speaker speech recognition and the HMM-based unspecified speaker speech recognition. The measures used in these two methods in matching the parameter sequences of input speech against the information on the parameter sequences of vocabulary stored in a storage device generally differ from each other. Hence, it is not easy to decide which word—a registered word unique to a particular speaker that is determined by the DTW-based specific speaker speech recognition as being closest to the input speech, or a generic word common to unspecified speakers that is determined by the HMM-based unspecified speaker speech recognition as being closest to the input speech—is closer to the input voice.
In the DTW-based specific speaker speech recognition, it is possible to realize the unspecified speaker speech recognition by using voice of a plurality of speakers for a generic word and storing a plurality of templates for that word. Using the DTW in this way can meet the above-described needs. This method, however, has drawbacks that the use of a plurality of templates for each generic word takes up extra storage space in the storage device, that the time taken by the DTW to reference a plurality of templates increases, and that when the generic words are to be changed, speech samples need to be collected from a large number of speakers.
To summarize, when the speech recognizer is mounted on car audio equipment, for example, while the use of the unspecified speaker speech recognizer is advantageous for the manufacturer because there is no need to register a large number of speech samples of the user, it gives the user a disadvantage that the recognition accuracy is slightly less than that of the specified speaker speech recognizer.
Although the specific speaker speech recognizer has a higher recognition accuracy, it is extremely difficult for the manufacturer to extract feature parameters from individual user's speech samples and store them in the speech recognizer in advance. If the user himself registers his speech, recording many words is very burdensome.
Further, because the conventional method used for the specific speaker speech recognition and the method used for the unspecified speaker speech recognition are different in kinds and nature, incorporating these two speech recognition methods into a single apparatus results in an increased size of the apparatus.