I. Field of the Invention
The present invention pertains generally to the field of communications, and more specifically to voice templates for speaker-independent voice recognition systems.
II. Background
Voice recognition (VR) represents one of the most important techniques to endow a machine with simulated intelligence to recognize user or user-voiced commands and to facilitate human interface with the machine. VR also represents a key technique for human speech understanding. Systems that employ techniques to recover a linguistic message from an acoustic speech signal are called voice recognizers. The term xe2x80x9cvoice recognizerxe2x80x9d is used herein to mean generally any spoken-user-interface-enabled device. A voice recognizer typically comprises an acoustic processor and a word decoder. The acoustic processor extracts a sequence of information-bearing features, or vectors, necessary to achieve VR of the incoming raw speech. The word decoder decodes the sequence of features, or vectors, to yield a meaningful and desired output format such as a sequence of linguistic words corresponding to the input utterance.
The acoustic processor represents a front-end speech analysis subsystem in a voice recognizer. In response to an input speech signal, the acoustic processor provides an appropriate representation to characterize the time-varying speech signal. The acoustic processor should discard irrelevant information such as background noise, channel distortion, speaker characteristics, and manner of speaking. Efficient acoustic processing furnishes voice recognizers with enhanced acoustic discrimination power. To this end, a useful characteristic to be analyzed is the short time spectral envelope. Two commonly used spectral analysis techniques for characterizing the short time spectral envelope are linear predictive coding (LPC) and filter-bank-based spectral modeling. Exemplary LPC techniques are described in U.S. Pat. No. 5,414,796, which is assigned to the assignee of the present invention and fully incorporated herein by reference, and L. B. Rabiner and R. W. Schafer, Digital Processing of Speech Signals 396-453 (1978), which is also fully incorporated herein by reference.
The use of VR (also commonly referred to as speech recognition) is becoming increasingly important for safety reasons. For example, VR may be used to replace the manual task of pushing buttons on a wireless telephone keypad. This is especially important when a user is initiating a telephone call while driving a car. When using a phone without VR, the driver must remove one hand from the steering wheel and look at the phone keypad while pushing the buttons to dial the call. These acts increase the likelihood of a car accident. A speech-enabled phone (i.e., a phone designed for speech recognition) would allow the driver to place telephone calls while continuously watching the road. And a hands-free car-kit system would additionally permit the driver to maintain both hands on the steering wheel during call initiation.
Speech recognition devices are classified as either speaker-dependent or speaker-independent devices. Speaker-dependent devices, which are more common, are trained to recognize commands from particular users. In contrast, speaker-independent devices are capable of accepting voice commands from any user. To increase the performance of a given VR system, whether speaker-dependent or speaker-independent, training is required to equip the system with valid parameters. In other words, the system needs to learn before it can function optimally.
A speaker-dependent VR device typically operates in two phases, a training phase and a recognition phase. In the training phase, the VR system prompts the user to speak each of the words in the system""s vocabulary once or twice (typically twice) so the system can learn the characteristics of the user""s speech for these particular words or phrases. An exemplary vocabulary for a hands-free car kit might include the digits on the keypad; the keywords xe2x80x9ccall,xe2x80x9d xe2x80x9csend,xe2x80x9d xe2x80x9cdial,xe2x80x9d xe2x80x9ccancel,xe2x80x9d xe2x80x9cclear,xe2x80x9d xe2x80x9cadd,xe2x80x9d xe2x80x9cdelete,xe2x80x9d xe2x80x9chistory,xe2x80x9d xe2x80x9cprogram,xe2x80x9d xe2x80x9cyes,xe2x80x9d and xe2x80x9cnoxe2x80x9d; and the names of a predefined number of commonly called coworkers, friends, or family members. Once training is complete, the user can initiate calls in the recognition phase by speaking the trained keywords, which the VR device recognizes by comparing the spoken utterances with the previously trained utterances (stored as templates) and taking the best match. For example, if the name xe2x80x9cJohnxe2x80x9d were one of the trained names, the user could initiate a call to John by saying the phrase xe2x80x9cCall John.xe2x80x9d The VR system would recognize the words xe2x80x9cCallxe2x80x9d and xe2x80x9cJohn,xe2x80x9d and would dial the number that the user had previously entered as John""s telephone number.
A speaker-independent VR device also uses a training template that contains a prerecorded vocabulary of a predefined size (e.g., certain control words, the numbers zero through nine, and yes and no). A large number of speakers (e.g., 100) must be recorded saying each word in the vocabulary.
Conventionally, speaker-independent VR templates are constructed by comparing a testing database containing words spoken by a first set of speakers (typically 100 speakers) to a training database containing the same words spoken by a second set of speakers (as many as the first set). One word, spoken by one user, is typically referred to as an utterance. Each utterance of the training database is first time normalized and then quantized (typically vector quantized in accordance with known techniques) before being tested for convergence with the utterances of the testing database. However, the time normalization technique relies upon information obtained only from individual frames (periodic segments of an utterance) having maximum differences from the previous frame. It would be advantageous to provide a method for building speaker-independent VR templates that uses more of the information in a given utterance. It would be further desirable to increase the accuracy, or convergence, of conventional techniques for building speaker-independent VR templates based upon the type of utterance. Thus, there is a need for a method of constructing speaker-independent speech recognition templates that provides enhanced accuracy and uses a greater amount of information in the utterances.
The present invention is directed to a method of constructing speaker-independent speech recognition templates that provides enhanced accuracy and uses a greater amount of information in the utterances. Accordingly, in one aspect of the invention, a method of creating speech templates for use in a speaker-independent speech recognition system is provided. The method advantageously includes segmenting each utterance of a first plurality of utterances to generate a plurality of time-clustered segments for each utterance, each time-clustered segment being represented by a spectral mean; quantizing the plurality of spectral means for all of the first plurality of utterances to generate a plurality of template vectors; comparing each one of the plurality of template vectors with a second plurality of utterances to generate at least one comparison result; matching the first plurality of utterances with the plurality of template vectors if the at least one comparison result exceeds at least one predefined threshold value, to generate an optimal matching path result; partitioning the first plurality of utterances in time in accordance with the optimal matching path result; and repeating the quantizing, comparing, matching, and partitioning until the at least one comparison result does not exceed any at least one predefined threshold value.