Accurate representations of speech have been demonstrated using harmonic models where a sum of sinusoids is used for synthesis. An analyzer partitions speech into overlapping frames, Hamming windows each frame, constructs a magnitude/phase spectrum, and locates individual sinusoids. The correct magnitude, phase, and frequency of the sinusoids are then transmitted to a synthesizer which generates the synthetic speech. In an unquantized harmonic speech coding system, the resulting speech quality is virtually transparent in that most people cannot distinguish the original from the synthetic. The difficulty in applying this approach at low bit rates lies in the necessity of coding up to 80 harmonics. (The sinusoids are referred to herein as harmonics, although they are not always harmonically related.) Bit rates below 9.6 kilobits/second are typically achieved by incorporating pitch and voicing or by dropping some or all of the phase information. The result is synthetic speech differing in quality and robustness from the unquantized version.
One approach typical of the prior art is disclosed in R. J. McAulay and T. F. Quatieri, "Multirate sinusoidal transform coding at rates from 2.4 kbps to 8 kbps," Proc. IEEE Int. Conf. Acoust., Speech, and Signal Proc., vol. 3, pp. 1645-1648, April 1987. A pitch detector is used to determine a fundamental pitch and the speech spectrum is modeled as a line spectrum at the determined pitch and multiples thereof. The value of the determined pitch is transmitted from the analyzer to the synthesizer which reconstructs the speech as a sum of sinusoids at the fundamental frequency and its multiples. The achievable speech quality is limited in such an arrangement, however, since substantial energy of the input speech is typically present between the lines of the line spectrum and because a separate approach is required for unvoiced speech.
In view of the foregoing, a recognized problem in the art is the reduced speech quality achievable in known harmonic speech coding arrangements where the spectrum of the input speech is modeled as only a line spectrum--for example, at only a small number of frequencies or at a fundamental frequency and its multiples.