1. Field of the Invention
This invention relates generally to the synthesis of electrical signals that mimic those of the human voice and other acoustic signals and more particularly the devices and methods to smooth frame boundary effects created during the encoding of the speech and acoustic signals.
2. Description of Related Art
Relevant publications include:
1. Yang et al., "Pitch Synchronous Multi-Band (PSMB) Speech Coding," Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP'95, pp. 516-519, 1995 (describes a pitch-period-based speech coder); PA0 2. Daniel W. Griffin and Jae S. Lim, "Multiband Excitation Vocoder," Transactions on Acoustics, Speech, and Signal Processing, Vol. 36, No. 8, August 1988, pp. 1223-1235 (describes a multiband excitation model for speech where the model includes an excitation spectrum and spectral envelope); PA0 3. John C. Hardwick and Jae S. Lim, "A 4.8 Kbps Multi-Band Excitation Speech Coder," Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP'88, pp. 374-377, New York 1988, (describes a speech coder that uses redundancies to more efficiently quantize the speech parameters); PA0 4. Daniel W. Griffin and Jae S. Lim, "A New Pitch Detection Algorithm," Digital Signal Processing '84, Elsevier Science Publishers, 1984, pp. 395-399, (describes an approach to pitch detection in which the pitch period and spectral envelope are estimated by minimizing a least squares error criterion between the synthetic spectrum and the original spectrum); PA0 5. Daniel W. Griffin and Jae S. Lim, "A New Model-Based Speech Analysis/Synthesis System," Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP'85, 1985, pp. 513-516 (describes the implementation of a model-based speech analysis/synthesis system where the short time spectrum of speech is modeled as an excitation spectrum and a spectral envelope); PA0 6. Robert J. McAulay and Thomas F. Quatieri, "Mid-Rate Coding Based On A Sinusoidal Representation of Speech," Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP'85, 1985, pp. 945-948 (describes a sinusoidal model to describe the speech waveform using the amplitudes, frequencies, and phases of the component sine waves); PA0 7. Robert J. McAulay and Thomas F. Quatieri, "Computationally Efficient Sine Wave Synthesis And Its Application to Sinusoidal Transform Coding," Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP'88, 1988, pp. 370-373, (describes a technique to synthesize speech using sinusoidal descriptions of the speech signal while relieving the computational complexity inherent in the technique); PA0 8. Xiaoshu Qian and Randas Kumareson, "A variable Frame Pitch Estimator and Test Results," Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP'96, 1996, pp. 228-231, (describes a new algorithm to identify voiced sections in a speech waveform and determine their pitch contours); and PA0 9. Ma Wei, "Multiband Excitation Based Vocoders and Their Real-Time Implementation", Dissertation, University of Surrey, Guildford, Surrey, U.K. May 1994, pp. 145-150 (describes vocoder analysis and implementations).
Sinusoidal synthesizers are widely used in multiband-excitation vocoders (voice coder/decoder) and sinusoidal excitation vocoders and therefore well known in the art. The principal behind these types of coders is to use banks of sinusoidal signal generators to produce excitation signals for the voiced speech or music. In order to smooth the frame boundary effects, interpolation of the phases of each sinusoidal waveform has to be performed which is normally on a sample by sample basis. This leads to a large computational burden.
There are a number of methods for computing the sinusoidal functions for the signal generators within a digital signal processor (DSP). These ways are a power series expansion, a table look-up, a second order filter, and a coupled form oscillator. The power series expansion is an accurate method for generation of the sinusoidal functions if the order is large enough. A table look-up method is generally considered as a fast approximation method and can give satisfactory accuracy as long as the appropriate table size is chosen. Nevertheless, the table index computation which is based on phase computation, requires either a conversion of floating point numbers to integers or integer multiplication with long word lengths. By comparison the fastest way to generate the sinusoidal functions is the use of a second order filter sinusoidal oscillator. Although it improves the speed of the computation, it can not be used in a synthesizer, because it requires linear phase increments which will not exist in the speech frames.
One way to solve this problem is to use the coupled form oscillator. The extra computations of orthogonal samples will reduce any speed gains and it will have the same speed as that of the table look-up method for sinusoidal synthesizer applications.
U.S. Pat. No. 4,937,873 (McAulay et al.) discloses methods and apparatus for reducing discontinuities between frames of sinusoidal modeled acoustic wave forms, such as speech, which occurs when sampling at low frame rates. The mid-frame interpolation, disclosed, will increase the frame rate and maintain the best fit of phases. However, after mid-frame estimation, a following stage of generating each speech sample is needed for the overlap-add synthesis stage. The method is based on a sample by sample or FFT method in the frequency domain to do the speech sample generation. The frequency domain will not provide a sharpness of speech that will be provide by execution in the frequency domain.
U.S. Pat. No. 5,179,626 (Thomson) discloses a harmonic coding arrangement where the magnitude spectrum of the input speech is modeled at the analyzer by a small set of parameters as a continuos spectrum. The synthesizer then determines the spectrum from the parameters set and from the spectrum of the parameter set, the synthesizer determines the plurality of sinusoids. The plurality of sinusoids are then summed to form synthetic speech.