Human beings exchange information with each other in a language, and the language includes two forms: speech and text. Transferring information by using speech is one of the most important basic functions of human beings. With the development of information technologies, a great amount of information also needs to be exchanged between human beings and machines. At present, computers have begun to simulate the process of information exchange between human beings.
The process of information exchange between human beings includes: 1. natural language generation: converting thought generated by the brain into a language; 2. speech synthesizing: converting the language into speech; 3. speech recognition: recognizing speech content that expresses the language; 4. natural language understanding: understanding language meanings expressed by the speech. The first two steps indicate a process executed by a speaker, and the last two steps indicate a process executed by a listener. The speech recognition is “recognizing speech content that expresses the language” during the foregoing process, and for a device, is recognizing speech spoken by human beings and converting the speech to text. The following describes the speech recognition from several aspects.
(1) Basic Principles of Speech Recognition by a Device:
Speech recognition is a mode recognition system, and the speech recognition includes the following steps:
1. language input;
2. preprocessing;
3. characteristics extracting, where characteristics are extracted and used as two branches for clustering training in 4 and for a recognizing operation in 5-7 respectively;
4. clustering training to obtain a template library;
5. executing similarity comparison by using a reference mode in the template library;
6. distortion detection on a result of step 5 during the recognizing process, and then 7 is performed; and
7. outputting a recognition result
Preprocessing includes processing, such as sampling and filtering, on speech signals, and functions of the characteristic extracting is to extract, from the speech signals, several groups of parameters describing characteristics of the signals, for example, energy, formant, and cepstral coefficient, to perform training and recognition. A process of establishing a speech recognition system is as follows: First, perform training by using a great amount of speech to obtain a template library, then read a template of the template library, and compare the template with to-be-recognized speech, to obtain a recognition result.
The following describes nouns mentioned in this application file:
Training: Analyzing in advance speech characteristics parameters, making a speech template and storing the speech template in a speech parameter library, where the template may also be called a model, and there are mainly two types of models: an acoustic model (AM) and a language model (LM). The acoustic model is used to recognize “sound” from a sound signal, and the language model is used to convert the sound into “text”;
Recognition: Obtaining a speech parameter by analyzing to-be-recognized speech in a way same as that for training, comparing the parameter with reference templates in the library one by one, finding, by using a determining method, a template closest to speech characteristics, and obtaining a recognition result, where the recognition result herein is in a form of text;
Distortion measures): A standard is required for comparison, and the standard is a “distortion measure” for measurement between speech characteristic parameter vectors, where the distortion measure is for comparison during a process of speech recognition, and there is a plurality of formulas for calculating the distortion measure, for example: calculating a distance between speech characteristic parameter vectors, more specifically, a distortion measure between a speech characteristic parameter vector A(x1, y1) and a speech characteristic parameter vector B(x2, y2) D=√{square root over ((x1−x2)2+(y1−y2))}.
Main recognition framework: dynamic time warping (DTW, Dynamic Time Warping) based on model matching, and hidden Markov model (HMM, Hidden Markov Model) based on a statistical model.
(2) Models of Speech Recognition:
An acoustic model is used to recognize “sound” from a sound signal, and a language model is used to convert the sound into “text”.
The most basic problem of speech recognition statistics is that an input signal or a characteristics sequence O={O1, O2, . . . , On}, and a word list V={w1, w2, . . . , wL} are given, and M words are randomly selected from V to form a word sequence W=(w1, w2, . . . , wM), and a word sequence W* corresponding to the characteristics sequence O is calculated, so that:
      W    *    =            argmax      W        ⁢          P      ⁡              (                  W          |          O                )            
According to the Bayes formula, the foregoing formula may be written as:
      W    *    =            argmax      W        ⁢                            P          ⁡                      (                          O              |              W                        )                          ⁢                  P          ⁡                      (            W            )                                      P        ⁡                  (          O          )                    
where P(O|W) is an acoustic model, and P(W) is a language model. It can be seen from the foregoing that the two models are a basis of the automatic speech recognition (ASR) technology.
It can be seen from the foregoing that the acoustic model is a probability from a characteristics sequence to a word sequence, and a great amount of speech data and corresponding text data needs to be obtained for training, to obtain an acoustic model for each word. However, there are a great number of words in any language, which leads to a huge number of acoustic models of the words, and also causes an excessively large calculation volume and excessively long calculation time during a recognition process. To resolve the problem, people figure out that a word is formed by phones, and a phone is the smallest pronunciation unit, for example, an onset and a rime in Chinese, or a syllable in English. However, a data volume of phones is relatively small, for example, about 60 in English. The problem can be well resolved by establishing an acoustic model with a phone as a unit. Another advantage is that the number of phones is fixed, and an acoustic model does not need to be reestablished when content of a word list changes.
An acoustic model with a phone as a unit also needs to correspond to a pronunciation dictionary. In the dictionary, a pronunciation of each word in a word list is provided, and for Chinese, a pinyin annotation of each word is listed, for example, “ zh ong g uo”.
A language model is a probability of a word sequence, and the probability can be decomposed into a product of probabilities that a plurality of two words or three words successively occurs.
Dual-word syntax: Occurrence of each word Wi is affected only by a word Wi−1 in front of the word Wi.
      P    ⁡          (      W      )        =            ∏              i        =        1            M        ⁢                  ⁢          P      ⁡              (                              W            i                    |                      W                          i              -              1                                      )            
Three-word syntax: Occurrence of each word Wi is affected only by a word Wi−1 in front of the word Wi.
      P    ⁡          (      W      )        =            ∏              i        =        1            M        ⁢                  ⁢          P      ⁡              (                                            W              i                        |                          W                              i                -                2                                              ,                      W                          i              -              1                                      )            
More syntax may be decomposed as required. Training of the acoustic model requires only text data, from which statistics about an occurrence probability of two words or multiple words are collected. It should be noted that, the text data needs to be sufficient to cover all words in a word list, and when content of the word list changes, the acoustic model needs to be updated frequently, to cover all the words in the word list.
In conclusion, for a speech recognition system, a phone set, a word list, and a pronunciation dictionary need to be input during model training, and a pronunciation dictionary, an acoustic model, and a language model need to be input during recognition.
(3) Factors Affecting a Recognition Rate of Speech Recognition
The recognition rate is defined as a percentage of input speech that is correctly recognized. The factors affecting a recognition rate include the following aspects:
1. a Size of a Word List
Accuracy of recognizing one word from 10 words is far greater than that of recognizing one word from 1000 words. A larger word list means more choices and more similar acoustic and linguistics content, that is, acoustic confusability and linguistic confusability are higher. Therefore, a speech recognition rate of a large word list is relatively low, and it is difficult to improve the recognition rate.
acoustic confusability:                Chang ting         Cheng qing         
linguistic confusability:                Shang/hai nan Go to Hainan        Shang hai/nan men South gate of Shanghai        
2. Speech Recognition for a Specific Field
In a specific field, syntax rules of a language are relatively fixed, and therefore linguistic confusability is relatively low, and a recognition difficulty is relatively low.
3. Noise
In one aspect, noise reduces intelligibility of speech; in another aspect, pronunciation of people changes greatly in a noisy environment, for example, a higher voice, a slower speaking speed, a higher pitch, and the like.
4. Training Data Volume
Sufficient data needs to be provided for training in each state, and a larger word list requires a larger data volume. For a state in which data is insufficient, some states are aggregated by using a tie method by a training program and trained by using same data; therefore, a recognition rate of a system is affected to some extent.
(4) Speech Command System
A speech command system is a set of devices or a system that recognizes a speech command and obtains text, and executes, according to the text, an action specified by the command, to meet a user's requirement. The speech command indicates that a user uses speech as a control interface, for example, a user inputs speech “turn on the radio” to control turning-on of a radio. Speech command recognition converts a speech command into text, and belongs to one type of speech recognition.
If a user expects that the system can recognize a specific speech command set, the user needs to provide a command word list, and also provide a corresponding pronunciation dictionary. The system uses the command word list as a word list and obtains, by training, a language model corresponding to the command word list. An acoustic model uses a phone as a unit, training of the acoustic model is irrelevant to the user, and only a speech library is required for training. A device reads a pronunciation dictionary, an acoustic model, and a language model; then, the device receives a speech command input by a user, executes speech recognition, and executes a corresponding operation according to a recognition result.
In the foregoing solutions, the user needs to preset a speech command list and a pronunciation dictionary. A new language model is obtained by training based on the speech command list and a language model. After a training process is complete, a recognizable speech command list, pronunciation dictionary, and language model is fixed. After entering a recognition process, the speech command list, the pronunciation dictionary, and the language model will not change. If the command word list needs to be changed, the recognition process needs to be suspended, and the training process is restarted. Therefore, generally, a large speech command word list and a large pronunciation dictionary are provided and used to train the language model.