Automatic speech Recognition technology is a sort of technology which transforms the lexical content of human speech into input characters that can be read by computers. The speech recognition has a complicated processing flow, mainly including four processes of acoustic model training, language model training, decoding resource constructing and decoding. FIG. 1 is a schematic diagram of the main processing flow in the conventional automatic speech recognition system. Refer to FIG. 1, the main processing flow includes:
Step 101 and 102, it requires to conduct the acoustic model training according to the acoustic material so as to obtain the acoustic model, similarly conducting the language model training according to the raw corpus so as to obtain the language model.
The mentioned acoustic model is one of the most important sections of speech recognition system, most of the mainstream speech recognition systems adopt HMM (Hidden Markov Model) to construct models, HMM is a statistical model which is used to describe the Markov process containing a hidden and unknown parameter. In HMM, the state is not directly visible, but some variants affected by the state are visible. The corresponding probability between speech and phone is described in the acoustic model. The mentioned phone is the minimum phonetic unit divided according to the natural property of speech. From the aspect of acoustic property, the phone is the minimum phonetic unit divided from the aspect of sound quality; from the aspect of physiological property, an articulation action forms a phone.
The main structure of the mentioned language model is the probability distribution p(s) of character string s, reflecting the probability of character string s appearing as a sentence. Suppose that w stands for every word in the character string s, so:p(s)=p(w1w2w3 . . . wn)=p(w1)p(w2|w1)p(w3|w1w2) . . . p(wk|w1w2 . . . wk-1)
Step 103, according to the mentioned acoustic model, language model and preset dictionary, the decoding resource is built accordingly. The mentioned decoding resource is Weighted Finite State Transducer (WFST) network.
Step 104, put the speech into the decoder, the mentioned speech will be decoded by the decoder according to the decoding resource that has been built, and output the character string with the highest probability value as the recognition result of the mentioned input speech.
However, most of the conventional speech recognition technology is based on the universal speech recognition application that constructs the model for the common speech recognition, in this situation, the training corpus of language model is based on the data collection and actual input of users, though it reflects well the speech habits of the users to some extent and often has a better recognition effect for the daily expression, because of less frequent obscure words in the training corpus of the language model, such as medicine name, place name, etc., it can't form an effective probability statistics model, the probability value of the character string corresponding to the obscure words in the language model is very low, so when it needs to recognize the obscure words spoken by the user, a problem of data offset often happens, it means the recognized character string is not the words spoken by the user, in other words, the recognition accuracy for the speech of the obscure words is lower, thus it is difficult to achieve a better recognition result.