1. Field of the Invention
The present invention relates generally to speech recognition, and more particularly to real-time speech recognition for recognizing speaker-independent, connected or continuous speech.
2. Description of the Background Art
Speech recognition refers to the ability of a machine or device to receive, analyze and recognize human speech. Speech recognition is also often referred to as voice recognition. Speech recognition may potentially allow humans to interface with machines and devices in an easy, quick, productive and reliable manner. Accurate and reliable speech recognition is therefore highly sought after.
Speech recognition gives humans the capability of verbally generating documents, recording or transcribing speech, and audibly controlling devices. Speech recognition is desirable because speech occurs at a much faster rate than manual operations, such as typing on a keyboard or operating controls. A good typist can type about 80 words per minute, while typical speech can be in the range of about 200 or more words per minute.
In addition, speech recognition can allow remote control of electronic devices. Many applications exist for impaired persons who cannot operate conventional devices, such as persons who are at least partially paralyzed, blind, or medicated. For example, a computer or computer operated appliances could be speech controlled.
Moreover, speech recognition may be used for hands-free operation of conventional devices. For example, one current application is the use of speech recognition for operating a cellular phone, such as in a vehicle. This may be desirable because the driver's attention should stay on the road.
Speech recognition processes and speech recognition devices currently exist. However, there are several difficulties that have prevented speech recognition from becoming practical and widely available. The main obstacle has been the wide variations in speech between persons. Different speakers have different speech characteristics, making speech recognition difficult or at best not satisfactorily reliable. For example, useful speech recognition must be able to identify not only words but also small word variations. Speech recognition must be able to differentiate between homonyms by using context. Speech recognition must be able to recognize silence, such as gaps between words. This may be difficult if the speaker is speaking rapidly and running words together. Speech recognition systems may have difficulty adjusting to changes in the pace of speech, changes in speech volume, and may be frustrated by accents or brogues that affect the speech.
Speech recognition technology has existed for some time in the prior art, and has become fairly reasonable in price. However, it has not yet achieved satisfactory reliability and is not therefore widely used. For example, as previously mentioned, devices and methods currently exist that capture and convert the speech into text, but generally require extensive training and make too many mistakes.
FIG. 1 shows a representative audio signal in a time domain. The audio signal is generated by capture and conversion of an audio stream into an electronic voice stream signal, usually through a microphone or other sound transducer. Generally, audible sound exists in the range of about 20 hertz (cycles) to about 20 kilohertz (kHz). Speech is a smaller subset of frequencies. The electronic voice stream signal may be filtered and amplified and is generally digitized for processing.
FIG. 2 shows the voice stream after it has been converted from the time domain into the frequency domain. Conversion to the frequency domain offers advantages over the time domain. Human speech is generated by the mouth and the throat, and contains many different harmonics (it is generally not composed of a single component frequency). The audible speech signal of FIG. 2, therefore, is composed of many different frequency components at different amplitude levels. In the frequency domain, the speech recognition device may be able to more easily analyze the voice stream and detect meaning based on the frequency components of the voice stream.
FIG. 3 shows how the digitized frequency domain response may be digitally represented and stored. Each digital level may represent a frequency or band of frequencies. For example, if the input voice stream is in the range of 1 kilohertz (kHz) to 10 kHz, and is separated into 128 frequency spectrum bands, each band (and corresponding frequency bin) would contain a digital value or amplitude for about 70 Hz of the speech frequency spectrum. This value may be varied in order to accommodate different portions of the audible sound spectrum. Speech does not typically employ all of the frequencies in the audible frequency range of 20 Hz to 20 kHz. Therefore, a speech recognition device may analyze only the frequencies from 1 kHz to 10 kHz, for example.
Once the voice stream has been converted to the frequency domain, an iterative statistical look-up may be performed to determine the parts of speech in a vocalization. The parts are called phonemes, the smallest unit of sound in any particular language. Various languages use phonemes that are not utilized in any other language. The English language designates about 34 different phonemes. The iterative statistical look-up employed by the prior art usually uses hidden Markov modeling (HMM) to statistically compare and determine the phonemes. The iterative statistical look-up compares multiple portions of the voice stream to stored phonemes in order to try to find a match. This generally requires multiple comparisons between a digitized sample and a phoneme database and a high computational workload. Therefore, by finding these phonemes, the speech recognition device can create a digital voice stream representation that represents the original vocalizations in a digital, machine-usable form.
The main difficulty encountered in the prior art is that different speakers speak at different rates and therefore the phonemes may be stretched out or compressed. There is no standard length phoneme that a speech recognition device can look for. Therefore, during the comparison process, these time-scale differences must be compensated for.
In the prior art, the time-scale differences are compensated for by using one of two approaches. In the dynamic time warping process, a statistical modeling stretches or compresses the wave form in order to find a best fit match of a digitized voice stream segment to a set of stored spectral patterns or templates. The dynamic time warping process uses a procedure that dynamically alters the time dimension to minimize the accumulated distance score for each template.
In a second prior art approach, the hidden Markov model (HMM) method characterizes speech as a plurality of statistical chains. The HMM method creates a statistical, finite-state Markov chain for each vocabulary word while it trains the data. The HMM method then computes the probability of generating the state sequence for each vocabulary word. The word with the highest accumulated probability is selected as the correct identification. Under The HMM method, time alignment is obtained indirectly through the sequence of states.
The prior art speech recognition approaches have drawbacks. One drawback is that the prior art approach is not sufficiently accurate due to the many variations between speakers. The prior art speech recognition suffers from mistakes and may produce an output that does not quite match.
Another drawback is that the prior art method is computationally intensive. Both the dynamic time warping approach and the HMM statistical approach require many comparisons in order to find a match and many iterations in order to temporally stretch or compress the digitized voice stream sample to fit samples in the phoneme database.
There have been many attempts in the prior art to increase speech recognition accuracy and/or to decrease computational time. One way of somewhat reducing computational requirements and increasing accuracy is to limit the library of phonemes and/or words to a small set and ignore all utterances not in the library. This is acceptable for applications requiring only limited speech recognition capability, such as operating a phone where only a limited number of vocal commands are needed. However, it is not acceptable for general uses that require a large vocabulary (i.e., normal conversational speech).
Another prior art approach is a speaker-dependent speech recognition wherein the speech recognition device is trained to a particular person's voice. Therefore, only the particular speaker is recognized, and that speaker must go through a training or “enrollment” process of reading or inputting a particular speech into the speech recognition device. A higher accuracy is achieved without increased cost or increased computational time. The drawback is that use of speaker-dependent voice recognition is limited to one person, requires lengthy training periods, may require a lot of computation cycles, and is limited to only applications where the speaker's identity is known a priori.
What is needed, therefore, are improvements in speech recognition technology.