There is a demand from workers of services, such as commodity distribution and medical services, to make the operations efficient and to realize hands-free operations through speech recognition.
In particular, in the services, such as the commodity distribution and medical services, inputting of character strings, such as the model number of a product and an ID of a product, having alphabets and numbers mixed is necessary in many cases. Hence, an excellent speech recognition accuracy for alphabets and numbers and a small number of false recognition remarkably contribute to the improvement of the efficiency of the services through the speech recognition.
However, an utterance of an alphabet is very short in particular, and alphabets have similar pronunciation with each other. Accordingly, it is difficult to precisely distinguish from each character.
For example, in the case of “C”, “E”, “T” and the like, a major part of the portion where utterance energy is intensive is a long vowel “í:” of the end of the utterance, and it is difficult even for a human to distinguish among them.
In particular, consonants are mixed with noises in an environment where noises are always present, such as a warehouse and a factory, and become unclear, and thus the recognition of the alphabets becomes further difficult.
Hence, according to the conventional method, for each alphabet, an English word beginning from that alphabet, such as A: alpha, B: bravo, and C: Charlie, is allocated, and the pronunciations of the words are registered in a speech recognition apparatus. A user utters those allocated English words so as to obtain alphabets corresponding to those English words.
In addition, a method for recognizing alphabets has been proposed, in which a user sequentially utters a given alphabet and another alphabet following the given alphabet in the alphabetic order (see, for example, Patent Literature 1, hereinafter, Patent Literature is referred to as “PTL”).
According to this method, “ADC” is read as “AB DE CD”, for example.
The above method intends to improve the recognition rate in comparison with a case of a single alphabet by utilizing the fact that the acoustic feature of an utterance becomes large when two alphabets are combined.