1. Technical Field
This invention relates to the field of speech synthesis, and more particularly to debugging and tuning of synthesized speech.
2. Description of the Related Art
Synthetic speech generation via text-to-speech (TTS) applications is a critical facet of any human-computer interface that utilizes speech technology. One predominant technology for generating synthetic speech is a data-driven approach which splices samples of actual human speech together to form a desired TTS output. This splicing technique for generating TTS output can be referred to as a concatenative text-to-speech (CTTS) technique.
CTTS techniques require a set of phonetic units that can be spliced together to form TTS output. A phonetic unit can be a recording of a portion of any defined speech segment, such as a phoneme, a sub-phoneme, an allophone, a syllable, a word, a portion of a word, or a plurality of words. A large sample of human speech called a TTS speech corpus can be used to derive the phonetic units that form a TTS voice. Due to the large quantity of phonetic units involved, automatic methods are typically employed to segment the TTS speech corpus into a multitude of labeled phonetic units. A build of the phonetic data store can produce the TTS voice. Each TTS voice has acoustic characteristics of a particular human speaker from which the TTS voice was generated.
A TTS voice is built by having a speaker read a pre-defined text. The most basic task of building the TTS voice is computing the precise alignment between the sounds produced by the speaker and the text that was read. At a very simplistic level, the concept is that once a large database of sounds is tagged with phone labels, the correct sound for any text can be found during synthesis. Automatic methods exist for performing the CTTS technique using the phonetic data. However, considerable effort is required to debug and tune the voices generated. Typical problems when synthesizing with a newly built TTS voices include incorrect phonetic alignments, incorrect pronunciations, spectral discontinuities, unnatural prosody and poor recording audio quality in the pre-recorded segments. These deficiencies can result in poor quality synthesized speech.
Thus, methods have been developed which are used to identify and correct the source of problems in the TTS voices to improve speech quality. These are typically iterative methods that consist of synthesizing sample text and correcting the problems found.
The process for correcting the encountered problems can be very cumbersome. For example, one must first identify the time offset where the speech defect occurs in the synthesized audio. Once the location of the problem has been determined, the TTS engine generated log file can be searched to identify the phonetic unit that was used to generate the speech at the specific time offset. From the phonetic unit identifier obtained from this log file, one can determine which recording contains this segment. By consulting the phonetic alignment files, the location of the phonetic unit within the actual recording also can be determined.
At this point, the recording containing this problematic audio segment can be displayed using an appropriate audio editing application. For instance, a user can first launch the audio editing application and then load the appropriate file. The defective audio segment at the location obtained from the phonetic alignment files can then be analyzed. If the audio editing application supports the display of labels, labels such as phonetic labels, voicing labels, and the like can be displayed, depending on the nature of the problem. If a correction to the TTS voice is required, accessing, searching and editing additional data files may be required.
It should be appreciated that identifying and correcting the source of problems in synthesized speech using the method described above is very laborious, tedious and inefficient. Thus, what is needed is a method of simplifying the debugging and tuning process so that this process can be performed much more quickly and with fewer steps.