1. Field of the Invention
The present invention relates to the technical field of speech synthesis and, more particularly, to an automatic speech segmentation and verification method and system.
2. Description of Related Art
Currently, in the technical field of speech synthesis, the concatenative synthesis based on a large speech corpus has become a popular approach to speech synthesis system because of the high acoustical quality and natural prosody that is provided. The key issues of speech synthesis systems include the number of synthesis units, the quality of the recorded material, the decision and selection of synthesis unit types, and the generation of natural rhyme and conjunction of synthesis units. Due to improvements in computer processing abilities, and the ubiquity of high capacity hard discs, the prior art speech synthesis system stores thousands of synthesis units to find proper synthesis units.
In the prior art speech synthesis method that uses large speech corpus, a main source of the synthesis units is a predetermined recording script, which is recorded by professional technicians with professional recording equipment. The computer system then automatically segments a recorded file according to phonetic information in the recording script to abstract speech units for the speech synthesis system.
However, sometimes the prior art segmentation position is incorrect, and a huge recording script requires performance and recording times, since even a professional technician can make pronunciation errors, insertion errors, and deletion errors, or co-articulation due to excessively quick speech. Since the segmentation positional accuracy and synthesis units' quality and correction directly affects the output speech synthesis quality, it is very important to improve the confidence measure to sieve bad recording material and redo it.
In addition to checking for segmentation errors, human effort is also required to check the consistency between the recorded speech and the phonetic transcription of the text script that was supposed to be read during recording. However, manual detection is laborious and is easily affected by personal reasons, which can cause totally different opinions about the same recording material instead of providing a consistent objective standard.
There are a lot of prior arts in this speech verification technique field, such as U.S. Pat. No. 6,292,778, which uses a speech recognizer and a task-independent utterance verifier to improve the word/phrase/sentence verification ratio. In this technique, the utterance verifier employs subword and anti-subword models to produce the first and second likelihoods for each recognized subword in the input speech. The utterance verifier determines a subword verification score as the log ratio of the first and second likelihoods, and the utterance verifier combines the subword verification scores to produce a word/phrase/sentence verification score, and compares that score to a predetermined threshold. U.S. Pat. No. 6,125,345 uses a speech recognizer to generate one or more confidence measures; the recognized speech output by the speech recognizer is input to the recognition verifier, which outputs one or more confidence measures. The confidence measures output by the speech recognizer and the recognition verifier are normalized and then input to an integrator. The integrator uses a multi-layer perceptron (MLP) to integrate the various confidence measures from the speech recognizer and the recognition verifier, and then determines whether the recognized utterance hypothesis generated by the speech recognizer should be accepted or rejected. U.S. Pat. No. 5,675,706 uses subword level verification and string level verification to determine whether an unknown input speech does indeed contain the recognized keyword, or consists of speech or other sounds that do not contain any of the predetermined recognizer keywords. The subword level verification stage verifies each subword segment in the input speech as determined by a Hidden Markov Model recognizer to determine if that segment consists of the sound corresponding to the subword that the HMM recognizer assigned to that segment. The string level verification stage combines the results of the subword level verification to make the rejection decision for the whole keyword.
However, the aforesaid patents are used for phonetic verification, which is also used for recognizing the phonetic section of “unknown” text content, not a “known” text content used in speech corpus. Furthermore, the phonetic verification is mainly used for solving out of vocabulary (OOV) problems. But the speech synthesis technique of speech corpus needs to insure that the recording and segmentation of every phonetic unit is correct; moreover, a target for the phonetic verification can be a word, a phase, or a sentence, which is different from a normal standard target (such as a syllable) for the speech synthesis technique of speech corpus.
Therefore, it is desirable to provide an automatic speech segmentation and verification method and related system to mitigate and/or obviate the aforementioned problems.