The techniques and methodology for the analysis (decomposition), modification, and synthesis (recomposition) of speech signals are generally represented as a single processing unit called “vocoder”. A vocoder is able to accomplish this processing efficiently and seeks to generate natural sounding speech, which is a fundamental element of technology involving the generation of vocal sounds (speech synthesis). Speech signals (e.g., waveforms) are processed in a short-term basis, typically in the range of ˜[20, 40] milliseconds, assuming that their main statistical properties remain unchanged, i.e, stationarity, within this temporal constraint. Most of the vocoding techniques do an initial classification of each frame based on whether the frame is either (a) “voiced” which is primarily the result of periodic action of the vocal chords or (b) “unvoiced” which is primarily due to aspiration noise from the lungs, for example.
Although voiced frames are predominated by periodic sound, these frames may also include a component of noise or nonharmonic sound. In the next stage of processing, the periodic component of the voiced frames is separated from the unvoiced portion. A popular way to model both voiced and unvoiced contributions is done using a technique called Harmonic plus Noise Modeling (HNM). In the context of HNM, periodic (voiced) and aperiodic (unvoiced) contributions are typically represented as time-varying harmonic and modulated-noise components, respectively.
In accordance with HNM, the content of both voiced and unvoiced components are divided in the frequency domain by a time-varying parameter referred to as Maximum Voiced Frequency (MVF): the lower band below this value is fully modeled by harmonic content (voiced) whereas the upper band, above the MVF, by noise alone as shown in FIG. 1.
To generate or otherwise synthesize speech waveforms, the combined contribution of the bands above and below the MVF are obtained by applying a high pass filter to modulated white-noise and a low pass filter to a fully harmonic component. The cut-off frequency of the two filters is set according to the MVF.
A vocoder based on a HNM signal recomposition may sometimes produce natural-sounding speech after transformation and/or re-synthesis if both voiced and noise bands are properly identified and processed. The estimation of the corresponding MVF, however, can be very challenging for numerous reasons including the following: (a) in context of the frequency-domain representation of speech frame, local maxima associated with harmonic components sometimes appear to be distributed non-uniformly across frequencies and those maxima separated by bands of noise or other unvoiced elements, as shown in FIG. 2, and (b) some periodic components of the waveform may be found on “unvoiced” frames where the transitions from predominantly voiced to predominantly unvoiced frames is smooth, which may result in degradation when regenerating the waveform. There is therefore a need for a technique to account for noise components non-uniformly distributed in “voiced” frames as well as periodic components in “unvoiced” frames.