The present invention relates to voice recognizers generally and to voice recognizers which use LPC vocoder data as input.
Voice recognizers are well known in the art and are used in many applications. For example, voice recognition is used in command and control applications for mobile devices, in computer Dictaphones, in children""s toys and in car telephones. In all of these systems, the voice signal is digitized and then parametrized. The parametrized input signal is compared to reference parametrized signals whose utterances are known. The recognized utterance is the utterance associated with the reference signal which best matches the input signal.
Voice recognition systems have found particular use in voice dialing systems where, when a user says the name of the person he wishes to call, the voice recognition system recognizes the name from a previously provided reference list and provides the phone number associated with the recognized name. The telephone then dials the number. The result is that the user is connected to his destination without having to look for the dialed number and/or use his hands to dial the number.
Voice dialing is especially important for car mobile telephones where the user is typically the driver of the car and thus, must continually concentrate on the road. If the driver wants to call someone, it is much safer that the driver speak the name of the person to be called, rather than dialing the number himself.
FIG. 1, to which reference is now made, shows the major elements of a digital mobile telephone. Typically, a mobile telephone includes a microphone 10, a speaker 12, a unit 14 which converts between analog and digital signals, a vocoder 16 implemented in a digital signal processing (DSP) chip labeled DSP-1, an operating system 18 implemented in a microcontroller or a central processing unit (CPU), a radio frequency interface unit 19 and an antenna 20. On transmit, the microphone 10 generates analog voice signals which are digitized by unit 14. The vocoder 16 compresses the voice samples to reduce the amount of data to be transmitted, via RF unit 19 and antenna 20, to another mobile telephone. The antenna 20 of the receiving mobile telephone provides the received signal, via RF unit 19, to vocoder 16 which, in tum, decompresses the received signal into voice samples. Unit 14 converts the voice samples to an analog signal which speaker 12 projects. The operating system 18 controls the operation of the mobile telephone.
For voice dialing systems, the mobile telephone additionally includes a voice recognizer 22, implemented in a separate DSP chip labeled DSP-2, which receives the digitized voice samples as input, parametrizes the voice signal and matches the parametrized input signal to reference voice signals. The voice recognizer 22 typically either provides the identification of the matched signal to the operating system 18 or, if a phone number is associated with the matched signal, the recognizer 22 provides the associated phone number.
FIG. 2, to which reference is now made, generally illustrates the operation of voice recognizer 22. The digitized voice samples are organized into frames, of a predetermined length such as 5-20 msec, and it is these frames which are provided (step 28) to recognizer 22. For each frame, the recognizer 22 first calculates (step 30) the energy of the frame.
FIG. 3, to which reference is now also made, illustrates the per frame energy for the spoken word xe2x80x9cRICHARDxe2x80x9d, as a function of time. The energy signal has two bumps 31 and 33, corresponding with the two syllables of the word. Where no word is spoken, as indicated by reference numeral 35, and even between syllables, the energy level is significantly lower.
Thus, the recognizer 22 searches (step 32 of FIG. 2) for the start and end of a word within the energy signal. The start of a word is defined as the point 37 where a significant rise in energy begins after the energy signal has been low for more than a predetermined length of time. The end of a word is defined as the point 39 where a significant drop in energy finishes after which the energy signal remains low for more than a predetermined length of time. In FIG. 3, the start point 37 occurs at about 0.37 sec and endpoint 39 occurs at about 0.85 sec.
If a word is found, as checked in step 34, the voice recognizer 22 performs (step 36) a linear prediction coding (LPC) analysis to produce parameters of the spoken word. In step 38, the voice recognizer 22 calculates recognition features of the spoken word and, in step 40, the voice recognizer 22 searches for a match from among recognition features of reference words in a reference library. Alternatively, the voice recognizer 22 stores the recognition features in the reference library, in a process known as xe2x80x9ctrainingxe2x80x9d.
Unfortunately, the voice recognition process is computationally intensive and, thus, must be implemented in the second DSP chip, DSP-2. This adds significant cost to the mobile telephone.
An object of the present invention is to provide a voice recognizer which operates with compressed voice data, compressed by LPC-based, vocoders, rather than with sampled voice data thereby to reduce the amount of computation which the recognizer must perform. Accordingly, the voice recognition can be implemented in the microcontroller or CPU which also implements the operating system. Since the voice recognizer does not analyze the voice signal, the microcontroller or CPU can be a of limited processing power and/or one which does not receive the voice signal.
Moreover, the present invention provides a feature generator which can extract the same type of feature data, for use in recognition, from different types of LPC based vocoders. Thus, the present invention performs the same recognition (e.g. matching and training) operations on compressed voice data which is compressed by different types of LPC based vocoders.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for recognizing a spoken word using linear prediction coding (LPC) based, vocoder data without completely reconstructing the voice data. The vocoder based recognizer implements the method described herein. The method includes the steps of generating at least one energy estimate per frame of the vocoder data and searching for word boundaries in the vocoder data using the associated energy estimates. If a word is found, the LPC word parameters are extracted from the vocoder data associated with the word and recognition features are calculated from the extracted LPC word parameters. Finally, the recognition features are matched with previously stored recognition features of other words, thereby to recognize the spoken word.
Additionally, in accordance with a preferred embodiment of the present invention, the energy is estimated from residual data found in the vocoder data. This estimation can be performed in many ways. In one embodiment, the residual data is reconstructed from the vocoder data and the estimate is formed from the norm of the residual data. In another embodiment, a pitch-gain value is extracted from the vocoder data and this value is used as the energy estimate. In a further embodiment, the pitch-gain values, lag values and remnant data are extracted from the vocoder data. A remnant signal is generated from the remnant data and from that, a remnant energy estimate is produced. A non-remnant energy estimate is produced from a non-remnant portion of the residual by using the pitch-gain value and a previous energy estimate defined by the lag value. Finally, the two energy estimates, remnant and non-remnant, are combined.
Moreover, in accordance with a preferred embodiment of the present invention, the vocoder data can be from any of the following vocoders: Regular Pulse Excitation-Long Term Prediction (RPE-LTP) full and half rate, Qualcomm Code Excited Linear Prediction (QCELP) 8 and 13 Kbps, Enhanced Variable Rate Code (EVRC), Low Delay Code Excited Linear Prediction (LD CELP), Vector Sum Excited Linear Prediction (VSELP), Conjugate Structure Algebraic Code Excited Linear Prediction (CS ACELP), Enhanced Full Rate Vocoder and LPC 10.
There is also provided, in accordance with a further preferred embodiment of the present invention, a digital cellular telephone which includes a mobile telephone operating system, an LPC based vocoder and a vocoder based voice recognizer. The recognizer includes a front end processor which processes the vocoder data to determine when a word was spoken and to generate recognition features of the spoken word and recognizer which at least recognizes the spoken word as one of a set of reference words.
Further, in accordance with a preferred embodiment of the present invention, the front end processor includes an energy estimator, an LPC parameter extractor and a recognition feature generator. The energy estimator uses residual information forming part of the vocoder data to estimate the energy of a voice signal. The LPC parameter extractor extracts the LPC parameters of the vocoder data. The recognition feature generator generates the recognition features from the LPC parameters.
Still further, in accordance with a preferred embodiment of the present invention, the front end processor is selectably operable with multiple vocoder types.