1. Field of the Invention
The present invention relates to a speech processing apparatus and method. In particular, embodiments of the present invention are applicable to speech recognition.
2. Description of Related Art
Speech recognition is a process by which an unknown speech utterance is identified. There are several different types of speech recognition systems currently available which can be categorised in several ways. For example, some systems are speaker dependent, whereas others are speaker independent. Some systems operate for a large vocabulary of words (e.g. >10,000 words) while others only operate with a limited sized vocabulary (e.g. <1000 words). Some system can only recognise isolated words/phrases whereas others can recognise continuous speech compromise comprising a series of connected phrases or words.
In a limited vocabulary system, speech recognition is performed by comparing features of an unknown utterance with speech models formulated from features of known words which are stored in a database. The acoustic models of the known words are determined during a training session in which one or more samples of the known words are used to generate reference patterns therefor. The reference patterns may be acoustic templates of the modelled speech or statistical models, such as Hidden Markov Models.
To recognise the unknown utterance, the speech recognition apparatus extracts a pattern (or features) from the utterance and compares it against each reference pattern stored in the database. Using a method of decoding, a scoring technique is used to provide a measure of how well each reference pattern, or each combination of reference patterns, matches the pattern extracted from the input utterance. The unknown utterance is then recognised as the word(s) associated with the reference pattern(s) which mast closely match the unknown utterances.
The generation of speech models for use with speech recognition systems is a difficult task. Large amounts of high quality speech data from many speakers must be collected. The data must then be accurately transcribed and then used to train speech models using computationally intensive algorithms. Some of the speech data is then used to evaluate the recognition accuracy of generated models. Typically, it is necessary to experiment with the number and complexity of models for a particular application so there my be many iterations of model training and testing (and possibly data collection) before a final speech model is settled upon.
Typically, in view of the expertise required for generating models, generating speech models takes place within an acoustic speech recognition research lab. It is, however, desirable that users lacking in speech recognition expertise could also develop their own speech models for their own applications.