The present disclosure relates to a voice analysis apparatus, a voice synthesis apparatus, and a voice analysis synthesis system.
Speech synthesis methods are classified into a unit-selection speech synthesis method and a statistical parametric speech synthesis method.
While the unit-selection speech synthesis method may synthesize high quality speech, it has limitations, such as excessive database dependency and difficulty in voice characteristics transformation. The statistical parametric speech synthesis method has advantages such as low database dependency, a small database size, and easy voice characteristics transformation, whereas it has a disadvantage, such as low quality of synthesized speech. Based on those characteristics, any one of the above two methods is selectively used for speech synthesis.
As a kind of statistical parametric speech synthesis, the Hidden Markov Model (HMM)-based speech synthesis system has been well known. In the HMM-based speech synthesis system, core factors determining speech quality are representation/reconstruction of a speech signal, training accuracy of sentence database, and smoothing intensity of output parameters generated in a training model.
Meanwhile, as related art speech modeling methods for representation/reconstruction of a speech signal, a Pulse or Noise (PoN) model, and a speech transformation and representation using adaptive interpolation of weighted spectrum (STRAIGHT) model have been proposed. The PoN model is a speech synthesis method using excitation and spectral parts divided. The STRAIGHT model represents speech using three parameters. The three parameters consist of a pitch value F0, spectrum smoothed in a frequency region, and aperiodicity for reconstructing aperiodicity of a signal disappearing in the course of spectral smoothing.
Since the STRAIGHT model use a small number of parameters, it may obtain an effect in that degeneration of reconstructed speech is small. However, the STRAIGHT model has drawbacks such as difficulty in F0 search, an increase in complexity of signal representation due to extraction of aperiodicity spectrum. Thus, a new model for representation/reconstruction of a speech signal is required.