1. Field of Invention
This invention relates to a speech recognition method, a speech recognition device, and a recording medium on which is recorded a speech recognition processing program, in which improved recognition capability is achieved by enabling speaker adaptation processing in which the speaker can be registered with respect to a specified word among recognizable words.
2. Description of Related Art
Recently, electronic devices which use speech recognition technology are used in various fields. As one example, a clock which is called a sound clock can be listed. In this sound clock, a current time and an alarm time can be set by sound, and the sound clock can inform a user of a current time by sound.
This type of sound clock can be used as a toy for children in addition to being used as a daily necessity. It is desired that the cost of the device itself can be as low as possible. Because of this, there is a large limitation on the CPU processing capability and memory capacity which are used. One of the problems to be solved is to have functions with high capability under these limitations.
Conventionally, many devices which use this type of speech recognition can perform speech recognition for a non-specific speaker, but in order to perform speech recognition for a non-specific speaker, a large size of standard speaker sound model data is needed, and a large-capacity ROM and a CPU with high processing ability are needed. Therefore, the cost can be eventually high.
Furthermore, even for a non-specific speaker, depending upon the type of the device, generation and gender which can be used are limited to a certain degree. As a result, standard speaker model data which is limited to a certain area is acceptable. Because of this, even if a large size of standard speaker sound model data is provided, there is a lot of waste. Additionally, there was a problem of a recognition percentage because it can correspond to a wide variety of non-specific speakers in an averaged manner.
In order to solve this problem, a relatively inexpensive speech recognition LSI exists which enables non-specific speaker recognition with respect to a plurality of recognizable words which are prepared in advance, and which, at the same time, has a function which performs a registration type of speech recognition by registering a sound of a specific speaker for the specific speaker.
In this type of speech recognition LSI, a word which is prepared in advance can certainly be recognized even for a sound of a non-specific speaker. Furthermore, sound data of a specific speaker can be registered for the specific speaker. Therefore, it is thought that speech recognition with high capability can be realized for a wide variety of speakers.
However, with respect to this type of conventional speech recognition LSI, recognition can be performed at a high recognition percentage for speech recognition for a registered specific speaker, but if the gender and/or age range of speakers is wide, the recognition percentage can be significantly decreased in general with respect to speech recognition for non-specific speakers.
Furthermore, in order to improve the recognition percentage for a non-specific speaker, there are devices such that a speaker speaks several dozen words and speaker adaptation can be performed based upon this sound data.
However, in general, there are many cases such that the speaker adaptation function can be applied to a device with a CPU with high processing capability and a large-capacity memory. There are many cases such that it is difficult to apply this function to a device with large limitations in the processing capability of the CPU and the memory capacity because low cost is strongly demanded for toys and daily necessities.
Therefore, one aspect of this invention is to register sound data which is obtained as a speaker who uses the device, speaks a specified word, and at the same time, to significantly improve the speech recognition percentage for the speaker who uses the device (recognition target speaker), by performing speaker adaptation using this registration data and the standard speaker sound model data.
In order to achieve this aspect, the speech recognition method and apparatus of this invention may have standard speaker sound model data which has been created from sound data of a plurality of non-specific speakers, and can recognize a plurality of predetermined words. Among the plurality of words which can be recognized, several words are selected as registration words. The recognition target speaker speaks the respective registration words, and registration word data is created and saved for the respective registration words from the sound data. When the registration words are spoken by the recognition target speaker, speech recognition is performed by using the registration word data. When other recognizable words are spoken, speech recognition is performed using the standard speaker sound model data.
In addition, the plurality of recognizable words may be divided according to the respective type of words, and are prepared as word sets which correspond to the respective divisions. A device is set to recognize words which belong to a given word set in the operating scene at a given point in time, determine which word set a word is input from at the current point in time, and based upon the determination result, the recognition of the word which has been input in the scene can be performed.
Furthermore, it is also acceptable to focus the recognition target speaker into an area which is set in advance based upon age and gender, create specific speaker group sound model data from the sound data of a plurality of non-specific speakers which belong to the area, and save this as the standard speaker group sound model data.
The recognition target speakers can include a plurality of speaker groups based upon the characteristics of the sound. The specific speaker group sound model data can also include specific speaker group sound model data corresponding to the plurality of speaker groups which have been created from the sound data of a plurality of non-specific speakers which belong to the respective speaker groups.
In addition, speaker learning processing may be performed using the registration word data, the standard speaker sound model data, and the specific speaker group sound model data, such that when a recognizable word other than one of the registration words is recognized, adaptation processing can be performed using the post-speaker learning data, and speech recognition is performed.
Additionally, the speaker learning processing may create an inputting speaker code book by a code book mapping method and any of the code books which has been created based upon the standard speaker sound model data or the specific speaker group sound model data. Furthermore, by using a universal code book, the inputting speaker code book may be vector-quantized, and a quantized inputting speaker code book may be created.
Furthermore, the speech recognition device of this invention may have standard speaker sound model data which has been created from the sound data of a plurality of non-specific speakers, and which can recognize a predetermined plurality of words. The speech recognition part has at least a sound analysis unit that analyzes sound which has been obtained as a speaker speaks, several words among the plurality of recognizable words being selected as registration words; registration word data which has been created for the respective registration words from sound data which has been obtained by having the recognition target speaker speak the respective registration words; and a controller which, when one of the registration words is spoken by the recognition target speaker, performs speech recognition using the registration word data, and which, when recognizable words other than the registration words are spoken, performs speech recognition using the standard speaker sound model data.
In this type of speech recognition device, the plurality of recognizable words may be divided according to the respective word type, and are prepared as word sets corresponding to the respective divisions. The device is set to recognize words in a given word set in the operating scene at that point in time. It is determined which word set a word is input from at the current point in time. Recognition of words input in the scene is performed based upon the determination result.
Furthermore, it is also acceptable to focus the recognition target speaker into an area which is set in advance, based upon age and gender, create specific speaker group sound model data from the sound data of a plurality of non-specific speakers which belong to the area, and save this as the standard speaker group sound model data.
In addition, the recognition target speakers can include a plurality of speaker groups, based upon the characteristics of the sound. The specific speaker group sound model data can also include specific speaker group sound model data corresponding to the plurality of speaker groups which have been created from the sound data of a plurality of non-specific speakers which belong to the respective speaker groups.
Furthermore, in the speech recognition device of this invention, the speaker learning processing may be performed using the registration word data, the standard speaker sound model data, and the specific speaker group sound model data. When recognizable words other than the registration words are recognized, speaker adaptation is performed using the post-speaker learning data, and speech recognition is performed.
Additionally, the speaker learning processing may create an inputting speaker code book by a code book mapping method and any of the code books which has been created based upon the standard speaker sound model data or the specific speaker group sound model data. By using the universal code book, the inputting speaker code book may be vector-quantized, and a quantized inputting speaker code book may be created.
Furthermore, a recording medium on which is recorded the speech recognition processing program of this invention may have standard speaker sound model data which has been created from the sound data of a plurality of non-specific speakers, and is a recording medium on which is recorded a speech recognition processing program which can recognize a plurality of predetermined words. The processing program may include a procedure which, with respect to several words which are selected as registration words among the plurality of recognizable words, creates and saves registration word data for the respective registration words from the sound data which has been obtained as the recognition target speaker speaks, and a procedure which, when a registration word is spoken by the recognition target speaker, performs speech recognition using the registration word data, and which, when a recognizable word other than recognizable words is spoken, recognizes the sound using the standard speaker sound model data.
The plurality of recognizable words may be divided according to the respective word type, and are prepared as word sets corresponding to the respective divisions. The device is set to recognize words in a given word set in the operating scene at that point in time. It is determined which word set a word is input from at the current point in time. Recognition of words input in the scene is performed based upon the determination result.
In addition, it is also acceptable to include a procedure that focuses the recognition target speaker into an area based upon age and gender, to create specific speaker group sound model data from the sound data of a plurality of non-specific speakers which belong to the area, and save this as the standard speaker group sound model data.
Furthermore, in the procedure where the specific speaker group sound model data is created and saved as the standard speaker group sound model data, the recognition target speakers may include a plurality of speaker groups based upon the characteristics of sound, and for specific speaker group sound model data, specific speaker group sound model data may be created corresponding to the plurality of speaker groups from sound data from a plurality of non-specific speakers which belong to the respective speaker groups.
In addition, the program has a procedure which performs speaker learning processing using the registration word data and the standard speaker sound model data or the specific speaker group sound model data. When a recognizable word other than the registration words is recognized, speaker adaptation is performed using postspeaker learning data, and speech recognition is performed.
Thus, this invention may select several words as registration words among a plurality of recognizable words which are prepared in advance and creates and saves registration word data for the respective registration words from the sound data when the speaker who is the recognition target speaks the respective registration words. Words which are frequently used are mainly selected as the registration words. A word which is frequently used has a high possibility of being spoken by a speaker in various situations. For example, there are many cases that the frequently-used word can be input from a position distant from the device. Even if a word is thus input from a position distant from the device, correct recognition is required.
Therefore, for example, if frequently-used words are registered as several words among a plurality of recognizable words, it is possible to improve recognition capability for these words, and the recognition percentage of all the recognizable words can be improved. Therefore, the user can use the device conveniently.
Furthermore, in this invention, according to the operation scene at a given point in time, the device may determine which word set words should be input from, and recognition of a word which has been input at the current time can be performed based upon the determination result. Therefore, for example, a plurality of registration words belong to the same word set, and the scene where these words will be input is determined for words which belong to the word set. Therefore, a high recognition percentage can be obtained because it is only necessary to perform recognition processing for the words within the word set.
Additionally, by thus using several words as registration words and creating registration word data corresponding to the speaker, speaker learning processing can be performed using registration word data, standard speaker sound model data, or specific speaker group sound model data. By so doing, even for recognizable words other than registration words, it becomes possible to perform speaker adaptation and speech recognition post-speaker learning data at the time of recognition. The recognition percentage can be significantly improved for all of the recognizable words, not just the registration words.
The speaker adaptation processing of this invention includes processing which creates an inputting speaker code book by the code book mapping method and any of the code books which has been created based upon standard speaker sound model data or specific speaker group sound model data, and vector-quantizes this inputting speaker code book using a universal code book which has been created from a wide variety of speakers, creating a quantized inputting speaker code book. When recognition is performed, speaker learning processing is performed using this quantized inputting speaker code book.
Thus, speaker adaptation processing is possible using the quantized inputting speaker code books, in which the data amount has been significantly reduced. Therefore, it is possible to minimize a memory (RAM) in which these are saved. Furthermore, because it is possible to significantly reduce the calculation amount which is needed for the recognition processing, it is possible to significantly reduce the processing burden on the controller (CPU), and a CPU with a small processing ability can be acceptable.
Furthermore, in this invention, the speaker who is a recognition target is focused into an area which is set in advance based upon age and gender. Specific speaker group sound model data is created and saved from the sound data of a plurality of non-specific speakers which belong to the area, and sound which has been input by the speaker who is the recognition target can be recognized using this specific speaker group sound model data.
This is effective when the user of this device is limited to a certain area, such as predominantly children or predominantly women.
When the speaker who is the recognition target can be thus focused into a certain area, even if standard speaker sound model data which can correspond to a speaker in any area is used, there is still a lot of waste, and not much recognition accuracy can be expected. Therefore, a speaker who is the recognition target among non-specific speakers is focused into an area which is set in advance based upon age and gender. Specific speaker group sound model data is created from the sound data of a plurality of speakers who belong to the area, and by using this specific speaker group sound model data, sound which has been input by the speaker who is the speech recognition target is recognized.
By so doing, because it is acceptable to provide specific speaker group sound model data in response to speaker groups in a certain area, the data amount can be significantly reduced, compared to the conventional standard speaker sound model data which has been created corresponding to a variety of speakers. Because of this, the memory capacity of a memory which saves the data can be minimized, and the burden of the recognition processing on the CPU can be reduced. In addition, because it is sound model data corresponding to specific speaker groups, the recognition capability can be significantly improved.
Additionally, it is also possible to prepare several of this type of specific speaker group sound model data that corresponds to speaker groups in certain areas. For example, male adults, female adults, and children can be prepared corresponding to several speaker groups. According to this, the device is effective when one device is used by a family.
Thus, even if specific speaker group sound model data corresponding to several speaker groups is provided, if it is limited to a certain area, the size of the sound model data can be minimized compared to having standard speaker sound model data which has been created corresponding to speakers in all areas. Furthermore, because it is specific speaker group sound model data corresponding to the respective speaker groups, recognition capability can be significantly improved.