This relates to speech synthesis and, more particularly, to databases from which sound units are obtained to synthesize speech.
While good quality speech synthesis is attainable using concatenation of a small set of controlled units (e.g. diphones), the availability of large speech databases permits a text-to-speech system to more easily synthesize natural sounding voices. When employing an approach known as unit selection, the available large variety of basic units with different prosodic characteristics and spectral variations reduces, or entirely eliminates, the prosodic modifications that the text-to-speech system may need to carry out. By removing the necessity of extended prosodic modifications, a higher naturalness of the synthetic speech is achieved.
While having many different instances for each basic unit is strongly desired, a variable voice quality is not. If it exists, it will not only make the concatenation task more difficult but also will result in a synthetic speech with changing voice quality even within the same sentence. Depending on the variability of the voice quality of the database, a synthetic sentence can be perceived as being "rough," even if a smoothing algorithm is used at each concatenation instant, and even perhaps as if different speakers utter various parts of the sentence. In short, inconsistencies in voice quality within the same unit-selection speech database can degrade the overall quality of the synthesis. Of course, the unit selection procedure can be made highly discriminative to disallow mismatches in voice quality but, then, the synthesizer will only use part of the database, while time (and money) was invested to make the complete database available (recording, phonetic labeling, prosodic labeling, etc.).
Recording large speech databases for speech synthesis is a very long process, ranging from many days to months. The duration of each recording session can be as long as 5 hours (including breaks, instructions, etc.) and the time between recording sessions can be more than a week. Thus, the probability of variations in voice quality from one recording session to another (inter-session variability) as well as during the same recording session (intra-session variability) is high.
The detection of voice quality differences in the database is a difficult task because the database is large. A listener has to remember the quality of the voice from different recording sessions, not to mention the shear time that checking a complete store of recordings would take.
The problem of assessing voice quality and its correction have some similarity to speaker adaptation problems in speech recognition. In the latter, "data oriented" compensation techniques have been proposed that attempt to filter noisy speech feature vectors to produce "clean" speech feature vectors. However, in the recognition problem, it is the recognition score that is of interest, regardless of whether the adapted speech feature vector really matches that of "clean" speech or not.
The above discussion clearly shows the difficulty of our problem: not only is automatic detection of quality desired, but any modification or correction of the signal has to result in speech of very high quality. Otherwise the overall attempt to correct the database has no meaning for speech synthesis. While consistency of voice quality in a unit-selection speech database is, therefore, important for high-quality speech synthesis, no method for automatic voice quality assessment and correction has been proposed yet.