Speaker independent phoneme based command word recognition and name dialling on portable devices such as mobile telephones and personal digital assistants has attracted significant interest recently. A phoneme based speaker independent recognition system provides a way around keypad limitations and offers more convenient hands-free operation. This allows safer use of portable devices in, for example, car environments. The speaker independence makes the system particularly attractive from a user point of view compared to speaker dependent systems. For large vocabularies, for example, command word lists or names in a phonebook, training of a speaker dependent recogniser is too tedious to be useful.
In contrast, a phoneme based speaker independent system is ready to use ‘out of the box’, i.e. it does not require any training session from the speaker. All that is required is a textual representation of the words or names in the recognition vocabulary along with some means of phonetically transcribing the text. Furthermore, speaker independent systems are only capable of supporting a single or a few languages at the same time, so that a separate set of phoneme models must be stored in the device for each supported language or set of languages. This increases the static memory requirements for the phoneme models.
Speech recognition in unknown environments is a very challenging task, as the recogniser must be robust in the presence of the noise and distortion encountered in the operating environment. In addition, the recogniser must be of sufficiently low complexity to be able to run on portable devices like mobile phones which inherently have limited memory and computational resources. Although the computational power of portable devices is rapidly increasing with time, the number of applications required to run simultaneously is also increasing. Therefore, complexity and memory requirements of any application running on a portable device will always be an issue.
A simple model of a conventional general purpose speech recognition system is shown in FIG. 1. Speech frames are derived from a speech signal using a speech pre-processor 1 and processed by a time alignment and pattern matching module 2 in accordance with an acoustic model 3 and a language model 4 to produce a recognition result. The language model includes a lexicon 5 which defines the vocabulary of the recogniser.
The pre-processor 1 transforms the raw acoustic waveform of the speech signal into an intermediate compressed representation that is used for subsequent processing. Typically, the pre-processor 1 is capable of compressing the speech data by a factor of 10 by extracting a set of feature vectors from the speech signal that preserves information about the uttered message. Commonly used techniques for pre-processing are filter bank analysis, linear prediction analysis, perceptual linear prediction and cepstral analysis.
Since the duration of words to be recognised are not known in advance, the process of time alignment and pattern matching is required to align hypothesised word sequences to the acoustic signal. The time alignment and pattern matching process uses information from both the acoustic model 3 and the language model 4 to assign a sequence of words to the sequence of speech frames. The acoustic model enables the speech frames to be translated to the basic units of a language such as words, syllables or phonemes that can be concatenated under the constraints imposed by the language model to form meaningful sentences. The time alignment method depends on the form of the acoustic model. Two well-known methods include dynamic time warping and Hidden Markov Modelling.
Dynamic time warping is a so-called template based approach in which the acoustic model is a collection of pre-recorded word templates. The basic principle of dynamic time warping is to align an utterance to be recognised to each of the template words and then to select the word or word sequence that provides the best alignment. However, this technique suffers from a number of drawbacks including the difficulty of modelling acoustic variability between speakers and the difficulty of providing templates for speech units other than whole words.
As a result of the problems associated with dynamic time warping, much of the recent work in speech recognition has concentrated on hidden Markov modelling (HMM), which removes the need to create a reference template by using a probabilistic acoustic model. In continuous speech recognition, the word models are typically constructed as a sequence of phoneme acoustic hidden Markov models corresponding to the word in question. A phoneme acoustic model is a statistical model, which gives the probability that a segment of the acoustic data belongs to the phoneme class represented by the model. Decoding in HMM models is done using, for example, a Viterbi or Forward decoder. Reference is directed to Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition”, Proc. IEEE, vol. 77, no. 2, February 1989, for an in-depth explanation of hidden Markov models.
A variant of the HMM model is known as the Hidden Neural Network model, which is an HMM/neural network hybrid. Reference is directed to [1] S. K. Riis, “Hidden Markov Models and Neural Networks for Speech Recognition”, Ph.D. Thesis, Department of Mathematical Modelling, Technical University of Denmark, May 1998 and [2] S. K. Riis and O. Viikki “Low Complexity Speaker Independent Command Word Recognition in Car Environments”, Proc. of the ICASSP, Vol. 2, pp. 1743-1746, Istanbul, May 2000, for a detailed explanation of HNNs.
One problem with the conventional approach to speech recognition is that every time a word boundary is hypothesised, the lexicon 5 which forms part of the language model has to be searched. For even a modest size of vocabulary, this search is computationally expensive. Several approximate fast match and pruning strategies have been proposed in order to speed up the search. Many of these use multi-pass decoding algorithms in which each pass prepares information for the next one, thereby reducing the size of the search space.
A further problem with conventional speech recognition is that the recogniser can have a preference for words of a certain length. For example, if non-uniform transition probabilities are used between states in HNN- or HMM-based recognisers, the recogniser often tends to favour short (long) words over the long (short) words in the lexicon.
The present invention aims to address the above problems.