1. Field of the Invention
The present invention relates to spoken dialog system and more specifically to improvements within the process of building a text-to-speech voice.
2. Introduction
A dialog system may include a text-to-speech (TTS) voice which synthesizes a human voice as part of a natural language dialog. Building a TTS voice is a complicated and expensive process. Concatenative TTS Synthesis requires a database of at 250,000 to a million or more correctly labeled half phonemes. Each word consists of a sequence of phonemes that correspond to the pronunciation of the words. A phoneme is a speaker-independent and context-independent unit of meaningful sound contrast. Half phonemes may refer to a portion of a phoneme. The synthesis of a human voice generally involves receiving text to be “spoken”, such as “how may I help you?” and analyzing and selecting the appropriate phonemes, concatenating them together, and then producing the associated audio that sounds like a human speaking the words.
Building a TTS voice also involves processing an audio file of words or sentences and labeling the file (manually or automatically). Labeling means determining and noting the start and stop point of each phoneme within the audio file. Since speech is a continuum, it is impossible for humans to label audio consistently. For many years, Automatic Speech Recognition (ASR) has been used to automatically label phonemes. This approach works fairly well, but ASR, even under ideal conditions, has an error rate of a few percent. There are many reasons for this error rate, but the biggest contributors is speaking errors by the people that speak and have their voices recorded to create the audio file, idiosyncratic pronunciations, and natural variation, both free and context sensitive.
An example of the context free variation is the optional articulation of word final /t/, as in “can't” versus “can'”. An example of context sensitive variation is when word final /t/ becomes a “flap” when the following word starts with an unstressed vowel and the speaker is speaking in a conversational style. The crux of the problem for voice building is that even if ASR is 99% accurate, in a database of a million phonemes, there will be 10,000 errors. Using traditional methods of voice building, the inventors have seen that ASR accuracy is on the order of 95-99% accurate, so a voice database built by these methods has so many errors that the overall quality of the finished TTS voice is noticeably degraded. The key to high ASR accuracy is using good speaker dependent acoustic models, and a dictionary that contains all possible variant pronunciations of every word in the lexicon. Then, the ASR is given the exact text that is being read along with every possible variant of every word in the text.
A voice building project involves managing thousands of audio files, text files and dictionaries. Traditionally, a TTS voice is built from 3000-20000 audio and text files. Traditional toolsets are not integrated. A method is needed whereby more than one person can work on a TTS voice building project. As voice building progresses, each utterance goes through a series of states. Any change management system can track states, however there is no voice building toolkit which integrates change management in such a way that one can request the “next item that needs to be done” in such a way that several people can work in parallel.
No matter how good the alignment process is, there will be errors in the final database, and human testers must listen to TTS synthesis to find these errors. Traditionally, this testing was hit-or-miss, and involved listening to hundreds or even thousands of hours of synthesized speech. Accordingly, further improvements in the process of generating a TTS voice are needed.