Speech recognition technology converts voice signals to corresponding character sequences and is widely used in different areas, such as human-machine interaction and audio/video search.
Conventional speech recognition technology often uses a word or a sentence as a basic identification unit. Audio characteristics are extracted from the voice signals. An optimal character sequence of the audio characteristics is calculated in a predetermined decoding search network through a Viterbi decoding. The optimal character sequence is provided as the speech identification result. Specifically, the predetermined decoding search network normally includes an acoustic model, a dictionary and a language model. The acoustic model is normally a hidden markov model (HMM) based on one phoneme or three phonemes. The dictionary includes a correspondence between words and phonemes. The language model includes a probability relationship among words in a character sequence. After the audio characteristics are input into the decoding search network, a phoneme sequence corresponding to the audio characteristics is identified using the acoustic model. A plurality of candidate words are found in the dictionary using the phoneme sequence. A sequence of candidate words with a highest probability is selected as an optimal text sequence through the probability relationship in the language model.
The above-noted conventional approach has some problems. For example, the decoding search network is established based on words. Any change in the dictionary (e.g., a collection of distinguishable words) often leads to restructuring a decoding search space, which causes inflexibility.
Hence it is highly desirable to improve the techniques for speech recognition.