1. Field of the Invention
The present invention relates to an apparatus and method for recognizing voice and, more particularly, to an apparatus and method for recognizing voice, which calculate scores for hidden Markov model states that represent feature parameters for each predetermined phonetic unit, using approximated single waveform probability distributions, and recalculate only scores for hidden Markov model states, having higher scores, using multiple waveform probability distributions.
2. Description of Related Art
The term ‘voice recognition’ refers to a series of processes of extracting phoneme and linguistic information from acoustic information included in voice, and causing a machine to recognize and respond to it.
Voice recognition algorithms include a Dynamic Time Warping (DTW) algorithm, a Neural Network (NN) algorithm, and a Hidden Markov Model (HMM) algorithm.
Of these voice recognition algorithms, the HMM algorithm statistically models a phonetic unit (phoneme or word), and is one of the voice recognition technology algorithms that have prevailed since the latter half of the 1980s. The HMM handles variation in a voice signal based on probability, so that it has an advantage in that variation in an input voice can be represented well, compared to Dynamic Programming Matching (DPM). Furthermore, the HMM can learn the parameters of models (coefficients for probability calculation) from a large volume of voice data and can generate better models by assigning high-quality data sets.
Each model obtained by performing modeling using the HMM represents a single phonetic element, and a single phonetic element generally has three states.
FIG. 1 is a diagram showing the states of a conventional HMM, and indicates that a single phonetic element has three states 11, 12 and 13.
In this case, a transition from each state to another state is made. These transitions are determined based on probability, and only state transition from the left to the right (in FIG. 1) is allowed. For example, a transition from a state (S1) 11 to a state (S2) 12 can be made or the state (S1) 11 can be restored, according to input conditions (S2) 12.
The HMM states are a plurality of states when a single phonetic element of an input voice signal is divided into the plurality of states, and may be classified into stable states and unstable states. In this case, when the single phonetic element, as shown in FIG. 1, is divided into the three states 11, 12 and 13, the first state 11 is an unstable state, the second state 12 is a stable state, and the third state 13 is an unstable state.
That is, both the first and third states 11 and 13 are unstable states because the first state 11 is affected by the previous state thereof and the third state 13 is affected by the subsequent state thereof, and the second state 12 is a stable state because it is almost completely unaffected by the first and third states 11 and 13.
When a single phonetic element is divided into a plurality of states in practice, the recognition rate varies according to the design of a transition structure. Dividing each phonetic element into a higher number of states is beneficial in terms of increasing the recognition rate, but there is a limitation in artificially manipulating the number of states, so a single phonetic element is generally represented using three divided states.
FIG. 2 is a diagram showing a word search network using the conventional HMM and a process of tracing an optimized word according to an input voice signal. In the network, figures described in respective nodes are indices that represent HMM states.
Such a word search network is optimized by taking the phonetic, lexical features of a recognition target word into account, so that it is called a lexical tree 20.
The optimized path of feature vectors extracted from an arbitrary input voice signal is searched for in the word search network. In this case, the traveling direction thereof may be determined based on probability.
In this case, a probability value for determining the traveling direction can be extracted from a probability distribution for each state. That is, branchable nodes are detected in the word search network, and the probability values of HMM states corresponding to the detected nodes are calculated. Of the nodes, the node having the largest value is determined to be a node to which traveling is directed.
Japanese Unexamined Pat. No. 2001-125589 disclosed a voice recognition apparatus that generates the acoustic model of a single Gaussian distribution using a speaker's voice data and a learning algorithm, and performs conversion on an HMM model so as to reduce the number of branches in each state. However, the disclosed apparatus uses only a single Gaussian distribution, so that the processing speed can be improved, but there is no a provision for the decrease in the recognition rate.
Accordingly, a voice recognition technology that can improve both processing speed and a recognition rate is needed.