1. Field of the Disclosure
The present disclosure relates to speech recognition, and more particularly to a method and an apparatus for selecting a vocabulary closest to an input speech from vocabularies stored in memory.
2. Description of the Related Art
Generally, speech recognition can be defined as “a sequence of procedures for extracting phonological or linguistic information from acoustic information contained in voice and enabling a machine to recognize and process the extracted information.” Voice conversation is recognized to be the most natural and convenient way of exchanging a large amount of information between human beings and machines. However, there is a limitation in that, in order to use voice conversation in communication between human beings and machines, voice must be translated into a machine-comprehensible code. Such a procedure of translating voice into code is speech recognition.
With respect to devices, having therein speech recognizers using speech recognition technology, for example, a computer, a Personal Digital Assistant (PDA) or an electronic home appliance, commands can be transmitted using a human voice without requiring a separate input device. For example, when desiring to purchase a movie ticket in advance, a user can obtain the desired result of advance purchase by simply speaking a movie title into a microphone, instead of clicking a mouse or pressing keys on a keyboard several times.
However, in order to implement a speech recognizer for recognizing 10,000 or more vocabularies, it is essential to reduce the required memory size and the number of calculations while maintaining a recognition rate. The reason for this is that portable devices are generally limited in memory size and Central Processing Unit (CPU) performance, and even the memory and CPU specifications of a fixed device cannot be increased without eventually increasing the cost of portable devices.
Therefore, in a device having a speech recognizer therein, since hardware, an operating system and other software must be operated in addition to speech recognition, only limited memory is used for speech recognition. Thus, it is difficult to recognize large-scale vocabularies using a conventional scheme in a device having a speech recognizer therein.
The following Table 1 shows required memory according to the number of vocabularies to be recognized in a conventional single pass speech recognition scheme, and Table 2 shows the ratios of portions, occupied by an acoustic model, a search network, a token, etc., to the entire required memory. Here, each value shown in Table 1 is in a unit of megabytes (Mbytes), and each value shown in Table 2 is represented in percentage (%).
TABLE 1AcousticSearchmodelnetworkTokenOthersTotal200word class2.160.100.100.322.6810,000word class2.162.983.450.529.11200,000word class2.1630.0027.002.0061.16
TABLE 2AcousticSearchmodelnetworkTokenOthersTotal200word class80.63.73.711.9100.010,000word class23.732.737.85.7100.0200,000word class3.549.144.13.3100.0
Referring to Table 1, it can be seen that, as the number of vocabularies to be recognized increases, required memory size rapidly increases from 2.68 Mbytes to 61.16 Mbytes. Further, referring to Table 2, it can be seen that, as the number of vocabularies to be recognized increases, the percentage of the portion occupied by a search network and a token, compared to an acoustic model, rapidly increases.
The above results in the conventional speech recognition scheme indicate that the conventional scheme loads all networks required for searches of memory. Accordingly, as the number of vocabularies to be recognized increases, the size of memory and the number of calculations rapidly increase. Therefore, it is difficult to recognize large vocabularies in a device having insufficient hardware support through the search method used in the conventional speech recognition scheme.
FIG. 1 is a diagram showing detailed fields of a related speech recognition technology. Technologies for reducing the hardware resources of a speech recognizer are classified into search area optimization technology and acoustic model optimiztion technology. Further, search area optimization technology is divided into an individual access scheme and a group access scheme.
The individual access scheme uses a model topology technique as disclosed in U.S. Pat. No. 6,178,401 (hereinafter referred to as '401 patent). Further, the group access scheme can be divided into a scheme using a representative lexicon group, and a lattice construction scheme using a small number of representative acoustic models.
The '401 patent “Method for reducing search complexity in a speech recognition system,” issued to IBM corporation, is described in brief below.
The technology discloses the steps of storing only the score of a state having the highest score and not the scores of the model-based node of a search network with respect to all states, selecting N candidates based on a terminal score, and performing detailed search on the N candidates.
As a result, the number of scores to be stored in a required token at the time of searching a network decreases, and distinctiveness of scores after a first search does not increase. Thus, there is an advantage in that the rate of errors caused by node-pruning is low. However, there still is a prevalent problem of increased memory requirement caused by search networks when recognizing large vocabularies, which remains to be overcome.