Voice recognizers are well known in the art and are used in many applications. For example, voice recognition is used in command and control applications for mobile devices, in computer Dictaphones, in children's toys and in car telephones. In all of these systems, the voice signal is digitized and then parametrized. The parametrized input signal is compared to reference parametrized signals whose utterances are known. The recognized utterance is the utterance associated with the reference signal which best matches the input signal.
Voice recognition systems have found particular use in voice dialing systems where, when a user says the name of the person he wishes to call, the voice recognition system recognizes the name from a previously provided reference list and provides the phone number associated with the recognized name. The telephone then dials the number. The result is that the user is connected to his destination without having to look for the dialed number and/or use his hands to dial the number.
Voice dialing is especially important for car mobile telephones where the user is typically the driver of the car and thus, must continually concentrate on the road. If the driver wants to call someone, it is much safer that the driver speak the name of the person to be called, rather than dialing the number himself.
FIG. 1, to which reference is now made, shows the major elements of a digital mobile telephone. Typically, a mobile telephone includes a microphone 10, a speaker 12, a unit 14 which converts between analog and digital signals, a vocoder 16 implemented in a digital signal processing (DSP) chip labeled DSP-1, an operating system 18 implemented in a microcontroller or a central processing unit (CPU), a radio frequency interface unit 19 and an antenna 20. On transmit, the microphone 10 generates analog voice signals which are digitized by unit 14. The vocoder 16 compresses the voice samples to reduce the amount of data to be transmitted, via RF unit 19 and antenna 20, to another mobile telephone. The antenna 20 of the receiving mobile telephone provides the received signal, via RF unit 19, to vocoder 16 which, in turn, decompresses the received signal into voice samples. Unit 14 converts the voice samples to an analog signal which speaker 12 projects. The operating system 18 controls the operation of the mobile telephone.
For voice dialing systems, the mobile telephone additionally includes a voice recognizer 22, implemented in a separate DSP chip labeled DSP-2, which receives the digitized voice samples as input, parametrizes the voice signal and matches the parametrized input signal to reference voice signals. The voice recognizer 22 typically either provides the identification of the matched signal to the operating system 18 or, if a phone number is associated with the matched signal, the recognizer 22 provides the associated phone number.
FIG. 2, to which reference is now made, generally illustrates the operation of voice recognizer 22. The digitized voice samples are organized into frames, of a predetermined length such as 5-20 msec, and it is these frames which are provided (step 28) to recognizer 22. For each frame, the recognizer 22 first calculates (step 30) the energy of the frame.
FIG. 3, to which reference is now also made, illustrates the per frame energy for the spoken word "RICHARD", as a function of time. The energy signal has two bumps 31 and 33, corresponding with the two syllables of the word. Where no word is spoken, as indicated by reference numeral 35, and even between syllables, the energy level is significantly lower.
Thus, the recognizer 22 searches (step 32 of FIG. 2) for the start and end of a word within the energy signal. The start of a word is defined as the point 37 where a significant rise in energy begins after the energy signal has been low for more than a predetermined length of time. The end of a word is defined as the point 39 where a significant drop in energy finishes after which the energy signal remains low for more than a predetermined length of time. In FIG. 3, the start point 37 occurs at about 0.37 sec and endpoint 39 occurs at about 0.85 sec.
If a word is found, as checked in step 34, the voice recognizer 22 performs (step 36) a linear prediction coding (LPC) analysis to produce parameters of the spoken word. In step 38, the voice recognizer 22 calculates recognition features of the spoken word and, in step 40, the voice recognizer 22 searches for a match from among recognition features of reference words in a reference library. Alternatively, the voice recognizer 22 stores the recognition features in the reference library, in a process known as "training".
Unfortunately, the voice recognition process is computationally intensive and, thus, must be implemented in the second DSP chip, DSP-2. This adds significant cost to the mobile telephone.