1. Field of the Invention
This invention generally relates to voice recognition technologies, and, in particular, to a voice recognition system having a checking scheme for registration of reference voice patterns. More specifically, the present invention relates to a voice recognition system which checks the voice patterns of certain words whose ends are difficult to detect before having the voice patterns registered as reference data.
2. Description of the Prior Art
Various schemes have been developed for recognizing voice information. A typical prior art voice recognition system has a dual mode operation, i.e., registration mode and recognition mode. The system is first set in the registration mode, whereby the user pronounces a plurality of words or characters and the pronounced sounds are stored in the form of voice patterns as reference data. And, then, the system is set in the recognition mode, whereby when the user makes an unknown sound, it is converted into a voice pattern which is then compared with each of the reference pattern data, thereby selecting the one reference pattern data which is most similar to the input unknown sound. In this manner, the input unknown sound can be identified.
FIG. 1 shows a typical prior art sound or voice recognition system which includes a microphone 1 as a transducer for converting a sound or voice in the form of pressure waves into the corresponding electrical signal. The microphone 1 is connected to a feature extractor 2 where the voice electrical signal is subjected to a predetermined processing operation to extract a predetermined feature from the voice electrical signal. Such a feature can be a time-frequency pattern if the voice signal is subjected to frequency analysis using a filter bank, or a LPC coefficient if the voice signal is subjected to LPC analysis. Thus, the feature to be extracted depends on the manner how the voice signal is analyzed. The feature extractor 2 is connected to a common contact point S.sub.0 of a selection switch S, which can be connected to either one of a pair of contacts S.sub.1 and S.sub.2. When the system is to be set in the registration mode, the switch S should be operated to establish a connection between the common contact S.sub.0 and the contact S.sub.1. Under this condition, the extracted feature of the voice signal is stored in a memory 3 as a reference pattern. This process is repeated as many times as desired to store a desired number of reference patterns in the memory 3.
Then, the system of FIG. 1 is set in the recognition mode by operating the switch S to establish a connection between the common contact S.sub.0 and the contact S.sub.2. Under this condition, the user makes a sound, for example, by pronouncing a desired word, and this sound is fed into the microphone 1 to be converted into an unknown corresponding voice signal which is then processed by the feature extractor 2, so that the feature of the unknown voice signal is extracted. The feature of this unknown voice signal is then compared with each of the reference data stored in the memory 3 at a matching unit 4, thereby selecting one of the plurality of reference data which is most similar to the feature of the unknown voice signal as a most likely candidate of the unknown sound. Then, this selected reference data is supplied to an output unit 5 to complete the process of recognizing the input unknown sound.
In such a voice recognition system, it is very important to register a reference data as accurately as possible. Otherwise, the rate of recognition cannot be improved no matter how advanced the matching scheme may be. It is often the case that difficulty is encountered in detecting the end of a voiced word. In particular, there are some words whose ends are difficult to detect when pronounced. For example, in the case of Japanese, most of the words ends with a vowel, and vowels /i/ and /u/ are difficult to detect because these vowels are often pronounced softly when placed at the end of a word, and, in some cases, they are almost lost when a word having either one of these vowels at the end is pronounced. Numeral "1" is pronounced "ichi" in Japanese, but the last vowel "i" is pronounced softly and often left out. Similarly, numeral "6" is pronounced "roku" and the last vowel "u" is pronounced very softly and is difficult to detect. The sound for Japanese character /n/ is also pronounced softly and it is often lost if it is placed at the end of a word. On the other hand, in the case of English, if a word ends with an explosive sound, such as "pink" or "stop", or with a particular combination of two or more characters, such as "ck" for "back", then the end of a voiced sound tends to be lost and cannot be detected. The English letters which explosive sounds include "p", "t", "k", " b", "d" and "g", and the combinations of English characters which are difficult to detect when placed at the end of a word include "ch", "ck" and "th."
Described more in detail in this respect with reference to FIG. 12, the voice power pattern shown in FIG. 12 is for the word "pink." In FIG. 12, the ordinate is taken for voice power (energy) and the abscissa is taken for time. Thus, as the word "pink" is pronounced, its voice power level changes with time. The striking feature of such a word as "pink" which has an explosive sound at the end resides in the fact that the last sound element /k/ is isolated and rather short in duration. Thus, it is often the case that this last sound element /k/ is not clearly produced and thus lost. If the voiced sound for "pink" is registered in the voice recognition system under such circumstances, it may be that the word "pink" is registered as "pin" and not as "pink." A typical prior art approach to cope with this problem is to store both of "pink" and "pin"; however, such an approach is not advantageous because it requires a large capacity memory for storing reference data, and it is more so for the case of English because many words end with a consonant which is pronounced independently.
When an English word is pronounced according to the Japanese sounds, a vowel tends to be added at the end of the word. For example, if "pink" is pronounced according to the Japanese sounds, it would sound like "pinku" with the addition of "u" at the end. In this case, the explosive sound is not located at the end of a word, but it is located at the second sound element from the end of the word. It is thus believed that, in any language, there are some or many words which include one or more characters, which are difficult to be detected when pronounced, at the end or near the end of the word. Therefore, there has been a need to develop a technology which can rectify this problem and thus allows to store reference data at high accuracy.