The present invention relates to language processing systems. In particular, the present invention relates concatenative text-to-speech (TTS) systems where speech output is generated by concatenating small stored speech units or segments one by one in series.
Ascertaining segmental boundaries for adjacent speech units used in a corpus-based concatenative TTS system is important in realizing naturalness in generated speech output from such systems. Prior techniques include manually labeling such boundaries. Although this technique is reliable, it is nevertheless very laborious and time consuming, making such a technique impractical to be applied to a large speech corpus.
Accordingly, there has developed a need to provide an automatic speech segmentation approach with comparable accuracy to human experts. Such a system and method would be particularly helpful when speech units are obtained from a large speech corpus. One segmentation method is referred to as “forced alignment” and is widely used in the training stage of HMM based Automatic Speech Recognition (ASR) systems. However, in performing forced alignment, boundary marks are to some extent under-estimated as Viterbi algorithm is targeted to match the wave stream to the whole labeled speech state sequence in a criterion minimizing the global distance. However, boundaries obtained in this manner are often not identical to the best splicing points between speech units. Thus, post-refinement is often performed to search for the most suitable locations for boundaries. The post-refinement technique uses a small amount of manually labeled boundaries for learning the characteristics of human-preferred boundary marks.
Various refining techniques have been used to refine the boundary locations. These techniques include using Gaussian Mixture Models (GMM), Hidden Markov Model (HMM), Neural Networks (NN) and Maximum Likelihood Probabilities (MLPs) to portray the boundary property. Some techniques have included classifying speech units by phonemic context, such as Vowel, Nasals, Liquids etc, where a refining model was trained for each group. However, classification is coarse such that the phonemic context within the same group may vary greatly. For example, /i/ and /u/, which are often clustered into the Vowel group, have quite different formant trajectories. Modeling them with the same refining model causes a loss in precision. An ideal solution is to train an individual model for each pair of speech unit boundaries. However, there are normally not sufficient manually labeled boundaries for training so many individual models.
Although various approaches have been tried to refine segmental boundaries for TTS speech units, none have achieved superior results, and thus improvements are continually needed.