The technical field of this invention is speech transmission and, in particular, methods and devices for pre-processing audio signals prior to broadcast or other transmission.
The problem of speech degradation by natural or man-made disturbances is one which commonly occurs in AM radio broadcasting and ground-to-air communications. Often in these applications, a peak-power limitation is imposed by the transmitter or a dynamic range constraint results either from the sensitivity characteristics of the receiver or from the ambient noise level. Under these constraints, the audio signals are preprocessed to increase intelligibility. Techniques such as dynamic range compression, pre-emphasis and clipping have been applied with limited success to reduce the peak factor of a waveform in order to increase loudness while attempting to preserve important features of the spectral envelope. For a further description of such techniques, see Modulation-Process Techniques for Sound Broadcasting, Tech. 3243-E, Technical Center of the European Broadcasting Union, Bruxelles, Belgium, July 1985, herein incorporated by reference.
There exists a need for better preprocessing techniques for speech transmission, particularly where the spectral magnitude is specified and the goal is to achieve a flattened time-domain envelope which satisfies peak power limitations. In particular, new techniques for accomplishing automatic gain control, (multiband) dynamic range compression, pre-emphasis and phase dispersion would satisfy a long-felt need in the field.
The above-referenced parent application U.S. Ser. No. 712,866 discloses that speech analysis and synthesis as well as coding and time-scale modification can be accomplished simply and effectively by employing a time-frequency representation of the speech waveform which is independent of the speech state. Specifically, a sinusoidal model for the speech waveform is used to develop a new analysis-synthesis technique.
The basic method of U.S. Ser. No. 712,866 includes the steps of: (a) selecting frames (i.e. windows of about 20-40 milliseconds) of samples from the waveform; (b) analyzing each frame of samples to extract a set of frequency components; (c) tracking the components from one frame to the next; and (d) interpolating the values of the components from one frame to the next to obtain a parametric representation of the waveform. A synthetic waveform can then be constructed by generating a series of sine waves corresponding to the parametric representation. The disclosures of U.S. Ser. No. 712,866 are incorporated herein by reference.
In one illustrated embodiment described in detail in U.S. Ser. No. 712,866, the basic method summarized above is employed to choose amplitudes, frequencies, and phases corresponding to the largest peaks in a periodogram of the measured signal, independently of the speech state. In order to reconstruct the speech waveform, the amplitudes, frequencies, and phases of the sine waves estimated on one frame are matched and allowed to continuously evolve into the corresponding parameter set on the successive frame. Because the number of estimated peaks are not constant and slowly varying, the matching process is not straightforward. Rapidly varying regions of speech such as unvoiced/voiced transitions can result in large changes in both the location and number of peaks. To account for such rapid movements in spectral energy, the concept of "birth" and "death" of sinusoidal components is employed in a nearest-neighbor matching method based on the frequencies estimated on each frame. If a new peak appears, a "birth" is said to occur and a new track is initiated. If an old peak is not matched, a "death" is said to occur and the corresponding track is allowed to decay to zero. Once the parameters on successive frames have been matched, phase continuity of each sinusoidal component is ensured by unwrapping the phase. In one preferred embodiment the phase is unwrapped using a cubic phase interpolation function having parameter values that are chosen to satisfy the measured phase and frequency constraints at the frame boundaries while maintaining maximal smoothness over the frame duration. Finally, the corresponding sinusoidal amplitudes are simply interpolated in a linear manner across each frame.