1. Field of the Invention
This invention relates to speech processing and in particular to speech recognition.
2. Description of Related Art
Developers of speech recognition apparatus have the ultimate aim of producing machines with which a person can interact in a completely natural manner, without constraints. The interface between man and machine would ideally be completely seamless.
This is a vision that is getting closer to achievement but full fluency between man and machine has not yet been achieved. For fluency, an automated recogniser would require an infinite vocabulary of words and would need to be able to understand the speech of every user, irrespective of their accent, enunciation etc. Present technology and our limited understanding of how human beings understand speech make this unfeasible.
Current speech recognition apparatus includes data which relates to the limited vocabulary that the apparatus is capable of recognising. The data generally relates to statistical models or templates representing the words of the limited vocabulary. During recognition an input signal is compared with the stored data to determine the similarity between the input signal and the stored data. If a close enough match is found the input signal is generally deemed to be recognised as that model or template (or sequence of models or templates) which provides the closest match.
The templates or models are generally formed by measuring particular features of input speech. The feature measurements are usually the output of some form of spectral analysis technique, such as a filter bank analyser, a linear predictive coding analysis or a discrete transform analysis. The feature measurements of one or more training inputs corresponding to the same speech sound (i.e. a particular word, phrase etc. are typically used to create one or more reference patterns representative of the features of that sound. The reference pattern can be a template, derived from some type of averaging technique, or it can be a model that characterises the statistics of the features of the training inputs for a particular sound.
An unknown input is then compared with the reference pattern for each sound of the recognition vocabulary and a measure of similarity between the unknown input and each reference pattern is computed. This pattern classification step can include a global time alignment procedure (known as dynamic time warping DTW) which compensates for different rates of speaking. The similarity measures are then used to decide which reference pattern best matches the unknown input and hence what is deemed to be recognised.
The intended use of the speech recogniser can also determine the characteristics of the system. For instance a system that is designed to be speaker dependent only requires training inputs from a single speaker. Thus the models or templates represent the input speech of a particular speaker rather than the average speech for a number of users. Whilst such a system has a good recognition rate for the speaker from whom the training inputs were received, such a system is obviously not suitable for use by other users.
Speaker independent recognition relies on word models being formed from the speech signals of a plurality of speakers. Statistical models or templates representing all the training speech signals of each particular speech input are formed for subsequent recognition purposes. Whilst speaker independent systems perform relatively well for a large number of users, the performance of a speaker independent system is likely to be low for a user having an accent, intonation, enunciation etc. that differs significantly from the training samples.
In order to extend the acceptable vocabulary, sufficient training samples of the additional vocabulary have to be obtained. This is a time consuming operation, which may not be justified if the vocabulary is changing repeatedly.
It is known to provide speech recognition systems in which the vocabulary that a system is to be able to recognise may be extended by a service provider inputting the additional vocabulary in text form. An example of such a system is Flexword from ATandT. In such a system words are converted from text form into their phonetic transcriptions according to linguistic rules. It is these transcriptions that are used in a recogniser which has acoustic models of each of the phonemes.
The number of phonemes in a language is often a matter of judgement and may depend upon the particular linguist involved. In the English language there are around 40 phonemes as shown in Table 1.
A reference herein to phonemes or sub-words relate to any convenient building block of words, for instance phonemes, strings of phonemes, allophones etc. Any references herein to phoneme or sub-word are interchangeable and refer to this broader interpretation.
For recognition purposes, a network of the phonemically transcribed text can then be formed from stored models representing the individual phonemes. During recognition, input speech is compared to the strings of reference models representing each allowable word or phrase. The models representing the individual phonemes may be generated in a speaker independent manner, from the speech signals of a number of different speakers. Any suitable models may be used, such as Hidden Markov Models.
Such a system does not make any allowance for deviations from the standard phonemic transcriptions of words, for instance if a person has a strong accent. Thus, even though a user has spoken a word that is in the vocabulary of the system, the input speech may not be recognised as such.
It is desirable to be able to adapt a speaker independent system so that it is feasible for use by a user with a pronunciation that differs from the modelled speaker. European patent application no. 453649 describes such an apparatus in which the allowed words of the apparatus vocabulary are modelled by a concatenation of models representing sub-units of words e.g. phonemes. The xe2x80x9cwordxe2x80x9d models i.e. the stored concatenations, are then trained to a particular user""s speech by estimating new parameters for the word model from the user""s speech. Thus known, predefined word models (formed from a concatenation of phoneme models) are adapted to suit a particular user.
Similarly European patent application no. 508225 describes a speech recognition apparatus in which words to be recognised are stored together with a phoneme sequence representing the word. During training a user speaks the words of the vocabulary and the parameters of the phoneme models are adapted to the user""s input.
In both of these known systems, a predefined vocabulary is required in the form of concatenated sequences of phonemes. However in many cases it would be desirable for a user to add words to the vocabulary, such words being specific to that users. One known means for providing an actual user with this flexibility involves using speaker dependent technology to form new word models which are then stored in a separate lexicon. The user has to speak each word one or more times to train the system. These speaker dependent models are usually formed using DTW or similar techniques which require relatively large amounts of memory to store each user""s templates. Typically, each word for each user would occupy at least 125 bytes (and possibly over 2 kilobytes). This means that with a 20 word vocabulary, between 2.5 and 40 kilobytes must be downloaded into the recogniser before recognition can start. Furthermore, a telephone network based service with just 1000 users would need between 2.5 and 20 Mbytes disc storage just for the users"" templates. An example of such a service is a repertory dialler in which a user defines the people he wishes to call, so that subsequently a phone call can be placed by speaking the name of the intended recipient.
European patent application no. 590173 describes a system in which a user, who speaks a word unknown to a recognition system, can correct the word and add this word to the vocabulary of the system. The only described method for making the new word known to the recognition system is by input via a keyboard.
In accordance with the invention a method of generating vocabulary for speech recognition apparatus comprises receiving an input speech signal representing an utterance; generating from each utterance a coded representation identifying from a plurality of reference sub-word representations a sequence of reference sub-word representations which most closely resembles the utterance; and storing the generated coded representation of the utterance for subsequent recognition purposes.
Such a method allows a user to choose new words without the need to form new acoustic models of each of the words, each word or phrase being modelled as a sequence of reference sub-word representations unique to that user. This does not require any previous knowledge regarding the words to be added to the vocabulary, thus allowing a user to add any desired word or phrase.
The coded representations of the words chosen by a user are likely to bear a closer resemblance to the user""s spoken speech than models formed from text. In addition, the coded representations require a memory capacity that is at least an order of magnitude less than storing the word representations as DTW models, (although this may be at a slight cost in accuracy).
Preferably, the generation of the coded representation is unconstrained by grammatical rules i.e. any sub-word representation can be followed by any other. Alternatively, a bigram grammar may be used which imposes transition probabilities between each pair of sub-words e.g. phonemes. Thus a pair of phonemes that do not usually occur in a given language (for instance P H in the English language) has a low transition probability.
Coded representations of more than one speech signal representing the same utterance may be generated. Any anomalies in the coded representation will then be accounted for. For instance, if an utterance is made over a noisy telephone line, the coded representation of the utterance may bear little resemblance to the coded representations of the same utterance over a clear telephone line. It may be appropriate to receive three training inputs of an utterance and discard a coded representations that differs significantly from the others. Alternatively all the coded representations may be retained. Whether or not all the coded representations are stored is determined by the developer of the apparatus.
In accordance with a second aspect of the invention vocabulary generation apparatus comprises deriving means for deriving feature samples from an input speech signal; a sub-word recogniser for generating from each sample of input speech signal a coded representation identifying from a plurality of reference sub-word representations a sequence of reference sub-word representations which most closely resembles the input speech signal; and a store for storing the coded representation of the input speech signal for subsequent recognition purposes.
The apparatus is intended to be associated with a speech recogniser which is configured to recognise the utterances represented by the coded representations. During recognition, the speech recogniser compares unknown input speech signals with the sequences of sub-word representations represented by the coded representations stored in the store and outputs a signal indicative of recognition or otherwise.
Preferably the grammar of the sub-word recogniser is loosely constrained. For instance, the sub-word recogniser may for example be constrained to recognise any sequence of sub-word units, bounded by line noise. Alternatively a bigram grammar may be used which imposes transition probabilities between each pair of phonemes.
The speech recognition apparatus may be configured to recognise also some pre-defined words. Preferably, the pre-defined words are also stored as coded representations of the sub-word transcriptions of the pre-defined words. The pre-defined words and the words chosen by a user are thus modelled using the same reference sub-words. The speech recogniser may be configured so as to recognise predefined words spoken in conjunction with user selected words.
Preferably the reference sub-word representations represent phonemes. Each sub-word representation may be a statistical model of a plurality of speakers"" input speech containing the particular sub-word. Preferably the models are Hidden Markov models although other models may be used.