1. Technical Field
The present invention relates to a system and method of speech processing; in particular, a symbiotic system and method for Automatic Speech Recognition and VoCoding. Hereinafter, the two functions shall be respectively abbreviated as "ASR" and "VC".
2. Discussion of Related Art
"VoCoding" means speech compression and speech regeneration in the field of speech processing. There exists numerous VC concepts and systems to convey a high quality approximation to human speech for communication over a Channel with very narrow bandwidth or is capable of handling only low data rate.
FIG. 1 shows a simplified view of the prior art of VC and its context. Original Speech Sounds 100 are transformed by a Microphone 102, which feeds an Analogue-To-Digital Converter 104. This produces a signal which includes (hereinafter abbreviated as "Sg") the original PCM 106, which is a wideband digital representation of speech using "Pulse Code Modulation" or PCM. This feeds a Speech Compressor 108, which produces Sg. Compressed Speech 110, which has much narrower bandwidth. This is communicated or stored and recalled via a Channel 120 with limited bandwidth. Upon receipt of the transmitted signal from the Channel, the Compressed Speech 122 feeds a Speech Regenerator. This produces a Sg. Regenerated PCM 124 which approximates the Sg. Original PCM. The Sg. Regenerated PCM feeds a Digital-to-Analogue Converter 126, which produces an analogue signal, which feeds an Amplifier 128, which feeds a Loudspeaker or Earphone 130, which emits Regenerated Speech Sounds 140, which approximates the Original Speech Sound 100.
One very widely used VC family is based on "waveform coding". For example, the Speech Compressor measures the serial correlation vector (auto-correlation vector) in the Sg. Original PCM. This correlation is expressed by a vector of Linear Prediction Coefficients (LPC). Also the Speech Compressor measures a "residual" signal which summarizes information not captured by the LPC. This LPC vector and residual are used in the Compressed Speech. A more complicated version is widely used for cellular telephony under the European standard "GSM-610".
Another very widely used large VC family is based on "sub-band coding". The Speech Compressor analyzes the Sg. Original PCM into a number of frequency bands, and produces a vector containing the spectrum. Also the Speech Compressor measures a "residual" signal which Us summarizes information not captured by this spectrum.
Another VC technique is based on the "Mel-Cepstrum Vector", described in "Cesptral Analysis Synthesis on the Mel Frequency Scale", by S.Imai, in "Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 1983 in Boston" (ICASSP'83), pg 93-96, as published by the IEEE. The disclosure therein is incorporated by reference herein.
In Imai, there is a Sg. Original PCM with about 10 k PCM values/sec. The Feature Extractor processes this as overlapping frames, at the rate of about 100 frame/sec, each containing about 200 PCM values/frame. For each frame, a Fourier Transform produces a Spectrum Vector of complex numbers. Each complex number is transformed by the complex Magnitude Function, followed by the real Logarithm Function. This produces a "Log Spectrum" of about 100 real values. This vector is "warped" to compensate for the non-uniform frequency sensitivity of the human hearing. This warped vector is smoothed, and transformed by the inverse Fourier Transform. This produces a Mel-Cepstrum Vector.
In the system described by Imai, it determines the fundamental pitch for voiced speech. In a parallel with the calculation of the Mel-Cepstral Vector, the Log Spectrum feeds an Inverse Fourier Transform, which produces a "Cepstrum Vector". The frequency of the maximum of this Cepstrum Vector is used to estimate the fundamental Pitch for voiced speech.
This Mel-Cepstrum Vector and Pitch together form the Compressed Speech of the Vocoder described by Imai. This feeds the Regenerator, which includes an Excitation Generator, and a Rapidly Adjustable Filter. The Pitch is fed into Excitation Generator, which produces an Excitation Signal, which drives a Rapidly Adjustable Filter, which produces Regenerated PCM. In a parallel calculation, the Mel-Cepstral Vector is used to adjust a Rapidly Adjustable Filter. This Filter is designed so its transfer function matches the given Mel-Cepstrum Vector. This Filter uses many forward and recursive (FIR and IIR) difference equations. These are based on a Pade' approximation to the exponential function, and based on a Mel-warping operator.
FIG. 2 is a simplified view of a Prior Art Automatic Speech Recognition (ASR) system. The Original Speech Sound 200 feeds a Microphone 202, which feeds an Analogue-To-Digital Converter 204, which produces Sg.Original PCM 206. This feeds a Feature Extractor 208, which produces a Raw Feature Vector 208. This feeds a Middle Processor 212, which produces a Sg. Adjusted Feature Vector 214. This feeds a Statistical Processor for ASR 220, which uses Acoustic Prototypes 222, and uses Linguistic Statistics 224, and uses a Vocabulary List 226. The Statistical Processor 220 produces a corresponding Recognized Text 230, which is shown via a Text Display 232. A User Interface is typically used to control and to monitor the ASR processes. FIG. 3 shows details of a prior art Feature Extractor as used in the IBM Voice Type 3 ASR system. This starts with a Sg. Original PCM 300, which typically has 11 k PCM values/sec. The Feature Extractor processes this as "frame" of 256 successive PCM values. Each second the Feature Extractor processes 100 partly overlapping frames. Each frame feeds a Fast Fourier Transform 302, which produces 304 a vector Fourier Spectrum of 128 complex numbers.
This Fourier Spectrum is transformed by a Magnitude Function 306 to produce a vector of 128 real numbers. This is transformed by Weighted Summation 308 using a constant matrix of Mel-Band Weights 312. These weights correspond to the non-uniform frequency sensitivity of human hearing. Thus is produced a Sg. Mel-Band Linear Vector 312. This feeds a Logarithmic Function 314 to produce a Sg. Mel-Band Logarithmic Vector 316. This is transformed by the Discrete Cosine Transform 318 using a constant matrix of Discrete Cosine coefficients 320. This produces a Sg. Raw Mel-Cepstrum Vector 322.
In FIG. 2, the Prior Art ASR system included a Middle Processor 212. This is shown in more detail in FIG. 4. One input is a Sg. Mel-Cepstrum Vector (MCV) 400. There is a Slope Calculation 405, which measures the change rate. The Slope Calculation linearly combines five consecutive Mel-Cepstrum Vectors, with weights (-2, -1, 0, +1, 2). The result is the Sg. Delta MCV 410.
Next, several consecutive delta vectors are analyzed, preferably five vectors, by another Slope Calculation 420, to produce a Sg. Delta-Delta MCV 425. Then three vectors (400, 410, 425) are concatenated to form a Sg. Tri-MCV 415.
In FIG. 4, another Input is the Sg. Original PCM 435, which feeds a Silence Detector 440. This classifies each frame as Silence or else Intermediate or else Speech. This classification is somewhat sophisticated, considers recently preceding frames, includes a "Finite State Machine", avoids some momentary exceptions, and handles many special cases. This classification controls a Gate 430 which produces a Sg. Adjusted MCV 445. If a frame is NOT Silence, then the Gate copies from the Sg Tri-MCV 415 to the Sg. Adjusted MCV 445. If the frame is Silence, then nothing is copied to the Sg. Adjusted MCV.
The ASR system of FIGS. 2 to 4 includes items for audio input of Raw PCM. This typically includes: a User-Speaker; Sound Waves; a Transducer for audio input, preferably a lip-mounted microphone suitable for noise cancellation; Means to reduce noise, such as a noise-cancellation circuit which works with the Transducer; an optional Amplifier; and an Analogue to Digital Converter or "ADC".
Also ASR system of FIGS. 2 to 4 includes items for audio output of Regenerated PCM. This typically includes: a Digital to Anaolgue Converter of "DAC"; an Amplifier; Means for Audio Output, such as an Earphone or Speaker; Regenerated Sound Waves.
Also ASR system of FIGS. 2 to 4 includes items for visual or audio Display of Recognized Text. This may be a visual Display to show text as visual characters. Alternatively, there may be means to convert Text to Speech (PCM), followed by audio output means to convert this to sound.