1. Field of the Invention
The present invention relates to automatic labeling used for speech recognition and speech synthesis, and more particularly, to a method and apparatus for rapidly and accurately performing automatic labeling by tuning a phoneme boundary by using an optimum-partitioned classified neural network of an MLP (Multi-Layer Perceptron) type.
2. Description of the Related Art
A speech corpus is a large collection of computer readable speech data, and is used as the basics for extracting a synthesis unit, and for phoneme and rhythm rules necessary for speech synthesis. In the case of speech recognition and speaker recognition, the speech corpus is an essential resource for training and evaluating a recognition algorithm. In addition to speech data, the speech corpus also contains index information about the speech data. Therefore, a specified word or sentence may be immediately heard after being accessed and speech materials including phoneme sequences or phonetic phenomena may be searched at will. Also, as speaker information is included in the speech corpus, several speech phenomena according to a speaker may be analyzed. In this case, labeling is used to provide additional information about several phonetic classifications in order to make the search possible. Units of labeling are a phoneme, word, syllable, sentence, etc.
In general, automatic phoneme labeling is performed with no user intervention from a given phoneme sequence and speech waveform data. But, in practice, due to errors between results of manual labeling and results of the automatic labeling, a tuning operation must be performed, usually after the automatic labeling. The tuning operation is repeatedly performed by an experience of a user via a synchronized listening of a manual labeling result and a speech signal. As such, considerable time is spent on this, and thus high-speed automatic labeling cannot be achieved.
In a phoneme labeling technique using an HMM (Hidden Markov Model), acoustic feature variables are segmented through a probability modeling procedure. Since the variables for the probability modeling are generated for a large speech corpus, a generated model for total training data may be considered as an optimum model. However, a phoneme segmentation technique using the probability modeling cannot reflect physical characteristics related to acoustic feature variables of a speech signal. Accordingly, the phoneme labeling using the HMM does not account for acoustic changes actually existing on a boundary between phonemes. On the other hand, speech segmentation techniques reflecting acoustic changes segment speech signal using only a transition feature of acoustic feature variables and slightly consider context information like automatic labeling. Therefore it is difficult to directly adapt a speech segmentation technique to automatic labeling.
Methods that adapt speech segmentation techniques to automatic labeling include, for example, a post-processing technique that tunes a result of automatic labeling. This method does not perform phoneme segmentation with speech segmentation, but phoneme segmentation using an HMM in advance, and then tunes phoneme segmentation by moving a detected phoneme boundary into a relatively small tuning field. Post-processing techniques include techniques using Gaussian model functions and techniques using neural networks to detect a phoneme boundary. In the last case, several feature variables based on MFCC (Mel Frequency Cepstral Coefficients) are used as input variables of a neural network, and error values of 0 and 1 are calculated at an output node by detecting whether present input feature variables are applicable to a phoneme boundary, and coefficients of the neural network are learned using a back-propagation algorithm. Because the neural network is not based on a probability modeling, this method has an advantage in that it can be directly used for automatic labeling. However, as the learned coefficients frequently converge to a local optimum and not a global optimum, according to the initial set of coefficients and learning data, label information tuned using the neural network may include more errors than label information obtained with only the HMM.
To solve this problem, a neural network is established by using all learning data, and the neural network is adaptively applied according to the appearance of smaller or larger errors depending on characteristics of left and right phonemes, assuming that the neural network is applied. For this, a relatively simplified method is used by which the characteristics of left and right phonemes are classified into vowels, consonants and mutes sounds. As such, the learning procedure depends on a user's information and the learning coefficients are determined only for a predetermined phoneme group.
Such a method is described by D. T. Toledano in “Neural Network Boundary Refining for Automatic Speech Segmentation,” Proceedings of ICASSP-2000, pp. 3438-3441 (2000), and E. Y. Park, S. H. Kim and J. H. Chung in “Automatic Speech Synthesis Unit Generation with MLP based Postprocessor against Auto-segmented Phone Errors,” Proceedings of International Joint Conference on Neural Networks, pp. 2985-2990 (1999).