1. Technical Field
The present invention relates generally to speech recognition. More particularly, the present invention is directed to a system and method for accurately recognizing continuous human speech from any speaker.
2. Background Information
Linguists, scientists and engineers have endeavored for many years to construct machines that can recognize human speech. Although in recent years this goal has begun to be realized in certain respects, currently available systems have not been able to produce results that even closely emulate human performance. This inability to provide satisfactory speech recognition is due primarily to the difficulties that are involved in extracting and identifying the individual sounds that make up human speech. These difficulties are exacerbated by the fact there are such wide acoustic variations that occur between different speakers.
Simplistically, speech may be considered as a sequence of sounds taken from a set of forty or so basic sounds called "phonemes." Different sounds, or phonemes, are produced by varying the shape of the vocal tract through muscular control of the speech articulators (lips, tongue, jaw, etc.). A stream of a particular set of phonemes will collectively represent a word or a phrase. Thus, extraction of the particular phonemes contained within a speech signal is necessary to achieve voice recognition.
However, a number of factors are present that make phoneme extraction extremely difficult. For instance, wide acoustic variations occur when the same phoneme is spoken by different speakers. This is due to the differences in the vocal apparatus, such as the vocal-tract length. Moreover, the same speaker may produce acoustically different versions of the same phoneme from one rendition to the next. Also, there are often no identifiable boundaries between sounds or even words. Other difficulties result from the fact that phonemes are spoken with wide variations in dialect, intonation, rhythm, stress, volume, and pitch. Finally, the speech signal may contain wide variations in speech-related noises that make it difficult to accurately identify and extract the phonemes.
The speech recognition devices that are currently available attempt to minimize the above problems and variations by providing only a limited number of functions and capabilities. For instance, many existing systems are classified as "speaker-dependent" systems. A speaker-dependent system must be "trained" to a single speaker's voice by obtaining and storing a database of patterns for each vocabulary word uttered by that particular speaker. The primary disadvantage of these types of systems is that they are "single speaker" systems, and can only be utilized by the speaker who has completed the time consuming training process. Further, the vocabulary size of such systems is limited to the specific vocabulary contained in the database. Finally, these systems typically cannot recognize naturally spoken continuous speech, and require the user to pronounce words separated by distinct periods of silence.
Currently available "speaker-independent" systems are also severely limited in function. Although any speaker can use the system without the need for training, these systems can only recognize words from an extremely small vocabulary. Further, they too require that the words be spoken in isolation with distinct pauses between words, and thus cannot recognize naturally spoken continuous speech.