Automatic Speech Recognition (ASR) systems are used in many applications to convert speech to text, for example digital dictation on a computer system or voice command recognition in embedded systems such as those provided in modern cars. Such systems take a digitised audio signal of an utterance, such as speech, as input and provides the text transcription of the audio signal as output. ASR is memory and processing power intensive, which is particularly problematic for embedded applications where limited use of resources and low cost are desirable.
The recognition is achieved by taking short samples of the speech, converting them to feature vectors that represent a speech segment, and mapping sequences of these vectors to possible sequences or concatenations of text units or words. The system associates a probability or likelihood to text unit sequences given a sequence of feature vectors, depending on how well they correspond to the feature vectors. The particular sequence of text units having the highest probability is prima facie the most likely textual transcription of the speech or feature vector sequence.
A typical application would be an on-board speech recognition system on a car. The effective resources available to the system may be limited to 1 megabyte RAM and 1 megabyte ROM memory and 100 MIPS cpu power. Typical input sentences could be “open the window” and “navigate to Baker Street”. The actual footprint required differs strongly between a small command and control system (which perhaps only needs to recognise some 100 short phrases such as “start cd player”) and a nagivation system (which may need to recognise thousands of streetnames).
Depending on the application, the set of all possible text unit sequences (sentences) can be small or very large. A language model represents a constraint on possible text unit sequences which make sense in the application. This is combined with a lexicon, which contains one or more pronunciations for every text unit. Using the language model and the lexicon a decoding network is constructed, such that a path through the network corresponds to a specific pronunciation of a specific text unit concatenation. An acoustic model is used to assign likelihood values to any path through the decoding network. These values depend on how closely the pronunciation implied in the path matches the observed feature vectors.
The decoding network represents the (often huge) number of paths in an efficient way by representing the paths as a network that connects nodes with arcs, possibly using techniques such as null nodes (that only serve to connect other nodes). A typical decoding network contains labels on arcs that represent text units, such that all paths together represent all valid sequences of text units in a particular language domain, for example the totality of valid commands in an in-car voice command recognition system. The nodes in such a network each represent one step in the chain of observations of feature vectors. This is usually associated with one or more states, but as noted above there are also null nodes which don't map to any state. A state is a multidimensional probability density function that enables calculating likelihoods of observations. One state can be associated to multiple nodes in one path, reflecting multiple occurrences of a sound, or in different paths, representing the same sound in different potential utterances.
A calculation is then performed to determine which path is the most likely, and in many applications this is taken as the textual transcription of the speech segment. In the above in-car command recognition system, this transcribed command is then input for a controller to, for example, open the window. This calculation is typically carried out using the Viterbi algorithm. Alternatively the Baum-Welch (or Forward-Backward) algorithm may be used. These algorithms can be formulated as Token Passing algorithms, as described in Token Passing: a simple conceptual model for connected speech recognition systems, by S. J. Young, N. H. Russell, J. H. S. Thornton, Cambridge University Engineering Department, Jul. 31, 1989.
These algorithms can be thought of as using tokens that are associated with a node in the decoding network and represent the best partial path from the start node up to that node. Each token is a (logical) data structure, is stored in memory, and is associated with a text unit or word history corresponding to the best partial path leading to that node. The tokens also comprise a likelihood “score” for the word history.
In many applications, the N-best word sequences are required, for example in case the user or speaker indicates that the best or highest likelihood sequence is incorrect, the next best or second highest likelihood sequence is offered as an alternative, and so on up to N. In the N-best case, not only the best but the N best paths up to every node have to be stored. The algorithms can handle this by extending a token such that it contains N word histories and associates a likelihood or score with every such word history. A further reason for maintaining the N best paths up to every node is the use of a statistical languguage model, which is a score based on the relative frequency of text unit sequences, which can be added to the likelihoods inside the token. In the specific case of using words as text units and considering the last three words, this is commonly known as a trigram language model. In that case it is still possible to provide alternative sequences to the application if required.
In these algorithms, a first token is created with an empty word history and is associated with the start node. After this, for every new feature vector, every token is copied to all nodes that it can reach though network arcs. There are also ‘self-loop’ arcs, which connect a node with itself, effectively making it possible for a token to remain at a node for some time. Every likelihood is updated with the likelihood of the feature vector given that state and also with a transition probability associated with the arc leading to the next node. When two or more tokens with equal word history meet, either the highest likelihood (Viterbi) or the combination is used (Baum-Welch). When two or more tokens with different word history meet, either the best one is chosen (1-best) or a selection from the various word histories is chosen that reflects the N best from the two tokens.
Processing through the network may halt after a predetermined end node is reached, and/or after a certain period, for example corresponding to the end of the speech segment. If successful, the token associated with the end node will contain a likelihood score corresponding to the or each sequence of nodes in the path(s) leading to the end node.
In a practical network containing perhaps thousands of nodes and far more possible paths, this has implications for memory space and cpu requirements. Various techniques are used to reduce the processing and/or the amount of memory resources utilised in the token passing process. For example pruning is used to delete tokens corresponding to very unlikely sequences so that further processing associated with that sequence can be halted in order to free up processing power and memory space.
Even with these existing techniques, ASR systems require significant processing power and memory resources, which is particularly problematic in smaller embedded applications such as in-car voice command recognition systems where there is a desire to minimise processor and/or memory resources.