Automatic speech recognition (ASR) is prone to errors.
ASR performs spectral analysis on audio signals and extracts features, from which the process hypothesizes multiple phoneme sequences, each with a score representing the likelihood that it is correct, given the acoustic analysis of the input audio. ASR proceeds to tokenize phoneme sequence hypotheses into token sequence hypotheses according to a dictionary, maintaining a score for each hypothesis. Tokens can be alphabetic words such as English words, logographic characters such as Chinese characters, or discernable elemental units of other types of writing systems. Tokenization is imprecise since, for example, English speakers pronounce the phrases “I scream” and “ice cream” almost identically. To deal with such ambiguities, ASR systems use the statistics of known frequencies of neighboring word tokens common in the spoken language to hypothesize which of multiple token sequence hypotheses is correct. For example, the word “ice” frequently follows the word “eat”, as in “eat ice cream”, but the word “I” rarely follows the word “eat”. Therefore, if the word sequence hypotheses “I scream” and “ice cream” follow the word “eat”, then the score of the word sequence hypothesis with “ice cream” increases while the score of the word sequence hypothesis with “I scream” decreases.
For example, Mandarin Chinese speakers pronounce the phrases “” and “” identically. Therefore, speech recognition uses the statistics of known frequencies of neighboring tokens common in the spoken language to hypothesize which of multiple token sequence hypotheses is correct. For example, the word “” frequently follows the word “”, as in “”, but the word “” rarely follows the word “”. Therefore, if the word sequence hypotheses “” and “” follow the word “”, then the score of the word sequence hypothesis with “” increases while the score of the word sequence hypothesis with “” decreases.
Conventional speech recognition and natural language understanding systems are relatively inaccurate and slow. They can produce transcriptions that are grammatically incorrect. Furthermore, their grammar rules are complex to create and improve. Also, grammars usually do not capture all of the informal and approximate ways that users express themselves, and as a result have insufficient coverage.