Automatic speech recognition is an area of technology which transforms the lexical content of human speech into an input form (e.g., a character string) that can be read by computers. The process of automatic speech recognition typically includes several operations, including: generating a language model that contains a plurality of words in a corpus, training an acoustic model to create statistical representations of one or more contrastive units of sound (called “phonemes” or simply “phones”) that make up each word in the corpus, building a decoding network (sometimes called a “decoding resource network) using the language model and the acoustic model, and finally decoding human speech.
FIG. 1 is a schematic diagram of a processing flow using a conventional automatic speech recognition system. In some circumstances, the processing flow is performed at a speech recognition device. Referring to FIG. 1, the processing flow includes:
Operation 101 and 102, in which an acoustic model is trained using sound samples. Similarly a language model is trained using a corpus.
The acoustic model is one of the most important aspects of a speech recognition system. Most of the mainstream speech recognition systems adopt Hidden Markov Models (HMM) to construct acoustic models. An HMM is a statistical model which is used to describe a Markov process containing a hidden parameter (e.g., a parameter that is not directly observed). In an HMM, although the hidden parameter is not directly observed, one or more variables affected by the hidden parameter are observed. In the context of speech recognition, a spoken phoneme is considered a hidden parameter, whereas acoustic data received (e.g., by a microphone of the device) is the observed variable. The corresponding probability between the spoken phoneme and the acoustic data is described in the acoustic model (e.g., the acoustic model describes the probability that acoustic data was generated by a user speaking a particular phoneme).
In some circumstances, a speech signal received by the device is expressed (e.g., represented) as a triphone. For example, such a triphone can be constructed by including a current phone as well as right and left half phones adjacent to the current phone.
The main structure of the language model is a probability distribution p(s) of a character string s, reflecting the probability of the character string s appearing as a sentence. Suppose wi stands for the ith word in the character string s. In this case, the probability distribution p(s) can be written as:p(s)=p(w1w2w3 . . . wn)=p(w1)p(w2|w1)p(w3|w1w2) . . . p(wk|w1w2 . . . wk-1)
Operation 103, in which a decoding resource network is constructed according to the acoustic model, language model and a presupposed dictionary. In some circumstances, the decoding resource network is a weighted finite state transducer (WFST) network.
Operation 104, in which speech is input into the decoder, the speech is decoded by the decoder according to the decoding resource network, and a character string with the highest probability value is output as the recognized result of the speech input.
FIG. 2 is a flowchart diagram of a method for constructing a decoding resource network using conventional technology. Referring to FIG. 2, the method includes: obtaining an Ngram WFST network by transforming the language model, obtaining a Lexicon-WFST by transforming the dictionary, and obtaining a Context-WFST network by transforming the acoustic model. These three WFST networks are merged, for example, by first merging the Ngram-WFST network and the Lexicon-WFST network into a Lexicon-Gram-WFST network, then merging the Lexicon-Gram-WFST with the Context-WFST network. Finally the decoding resource network is obtained. In this example, the decoding resource network is a Context-Lexicon-Gram-WFST network.
However, most conventional speech recognition technology is based on a universal speech recognition application that constructs model based on common speech. In this situation, the corpus used to the train the language model is based on data collected through the actual input of users. Though the speech habits of users are well reflected in such a model, these models struggle to recognize less frequently used (e.g., obscure) words, such as personal names, medicinal names, place names, etc. This is because the probability value of the character string corresponding to the obscure words in the language model is very low. When conventional speech recognition systems need to recognize obscure words spoken by the user, they too often fail.
Thus, what is needed is speech recognition technology (e.g., methods and systems) which are more easily able to recognize the use of obscure words.