Speech recognition, the machine translation of spoken utterances into a stream of recognized words or phrases, has received considerable attention from researchers in recent years. In the last decade, speech recognition systems have improved enough to become available to an ever larger number of consumers in the market place.
A number of applications utilizing speech recognition technology are currently being implemented in the telephone network environment, including the digital cellular network environment. For example, a telephone user's spoken commands may now determine call routing or how a call is billed (e.g.,"collect call please" or "calling card"). Similarly, a telephone user may transact business by dialing a merchant's automated system and speaking a credit card number instead of dialing one. Further and future use of speech recognition technology in the digital cellular environment could enhance service in a limitless number of ways.
The Internet, which has also grown and become more popular in recent years, provides another environment in which subscribers may benefit extensively from further use of speech recognition technology. For example, in the future, commercially available systems may allow a user at a remote station to specify, via voice commands, instructions which are then transmitted to an Internet host and executed.
However, Internet connection lines and digital cellular channels have limited transmission capacity with respect to audio or real-time speech. As a result, applications which involve real-time processing of large amounts of speech data over these mediums will often require data compression (or data encoding) prior to transmission. For example, the low bandwidth requirement for the digital cellular medium requires the use of voice data compression that can compress from 5-to-1 to 10-to-1 depending on the compression algorithm used. Compression algorithms used in some Internet browsers operate in this range as well.
Thus, in the network environment, voice data must often be compressed prior to transmission. Once the data reaches a speech recognition engine at a remote site, the limited network bandwidth is no longer a factor. Therefore, it is common practice to de-compress (or decode and reconstruct) the voice data at that point to obtain a digital representation of the original acoustic signal (i.e., a waveform). The waveform can then be processed as though it was originally generated at the remote site. This procedure (i.e., compress-transmit-decompress) allows speech recognition applications to be implemented in the network environment and overcomes issues relating to bandwidth limitation.
However, there are a number of disadvantages associated with this procedure. Specifically, it generally involves redundant processing steps as some of the work done during compression is repeated by the recognition "front-end" processing.
Specifically, much of the speech compression done today is performed by "vocoders." Rather than create a compressed, digital approximation of the speech signal (i.e., an approximation of a waveform representation), vocoders instead construct digital approximations of components or characteristics of speech implied by a given speech model. For example, a model may define speech as frequency of vocal chord movement (pitch), intensity or loudness of vocal chord movement (energy) or resonance of the vocal tract (spectral). The vocoding algorithm then applies signal processing techniques to the speech signal, leaving only specific signal components including those measuring pitch, energy and spectral speech characteristics.
In similar fashion, a speech recognition system operates by applying signal processing techniques to extract spectral and energy information from a stream of in-coming speech data. To generate a recognition result the extracted speech components are converted into a "feature" and then used in the alignment sub-system where the in-coming feature is compared to the representative features of the models.
Thus, when vocoded speech data is reconstructed into a waveform signal (decompressed) prior to speech recognition processing, speech components (or features) are effectively computed twice. First, during compression, the vocoder will decompose the original (digitized) signal into speech components. Then, during recognition processing, if the incoming data is a reconstructed waveform, the recognition facility must again extract the same or similar features from the reconstructed signal.
Obviously, this procedure i's not optimally efficient. This is particularly true when the step of determining features from the reconstructed signal (i.e., its digital representation) involves significant computational resources and added processing time.