In the theory of speech generation, the following source-filter model is widely used:s(t)=e(t)*f(t);wherein, s(t) is the speech signal; e(t) is the glottal source excitation; f(t) is the system function of the vocal tract filter; t represents time; and * represents convolution.
FIG. 1 illustrates such a source-filter model for speech generation. As shown, the input signal from the glottal source is processed (filtered) by the vocal tract filter. At the same time, the vocal tract filter is disturbed, that is, the features (state) of the vocal tract filter varies over time. The output of the vocal tract filter is added with noise to produce the final speech signal.
In such a model, the speech signal is usually easy to be recorded. However, neither the glottal source or the features of the vocal tract filter can be detected directly. Thus, an important issue in speech analysis is, given a piece of speech, how to estimate both the glottal source and the vocal tract filter features.
This is a problem of blind deconvolution with no definite solutions, unless additional assumptions are introduced, such as a predefined parameterized model of the glottal source, and a model of a vocal tract filter. Predefined parameterized models of glottal source include Rosenberg-Klatt (RK) and Liljencrants-Fant (LF), for which reference can be made to D. H. Klatt & L. C. Klatt, “Analysis, synthesis and perception of voice quality variations among female and male talkers,” J. Acoust. Soc. Am., vol. 87, no. 2, pp. 820-857, 1990, and G. Fant, J. Liljencrants & Q. Lin, “A four-parameter model of glottal flow,” STL-QPSR, Tech. Rep., 1985, respectively. Models of vocal tract filter include LPC, i.e., an all-pole model, and a pole-zero model. The limitation of these model lies in that they are oversimplified with only a few parameters, and inconsistent with the situation of real signals.
That is to say, methods in prior art typically estimate both the glottal source and the vocal tract filter parameters, but since this is very difficult, in order to make the solution of the problem more definite, subjective assumptions have to be introduced, such as applying some approximate models to the glottal source, simplifying and reducing the order of the vocal tract filter, etc. All the subjective assumptions and processing will affect the accuracy or even correctness of the solution.
Moreover, in many actual application scenarios, speech signals are often ill-conditioned or under-sampled, which limits the application of current techniques, making them unable to extract full information from some piece of speech signal.
In addition, methods in prior art generally rely on the periodicity of speech signals, thus requiring the pitch marking of the fundamental period, that is, marking the start and stop points of each period. However, even if all pitch marking is performed manually, sometimes ambiguities will occur, thus affecting the correctness of the speech analysis.
Therefore, a need apparently exists in the field for a simpler, accurate, more efficient and robust speech analysis and synthesis method.