A text-to-speech synthesis (TTS) system converts text inputs (e.g. in the form of words, characters, syllables, or mora expressed as Unicode strings) to synthesized speech waveforms, which can be reproduced by a machine, such as a data processing system. A typical text-to-speech synthesis system consists of two components, a text processing step to convert the text input into a symbolic linguistic representation, and a sound synthesizer to convert the symbolic linguistic representation into actual sound output. The text processing step typically assigns phonetic transcriptions to each word, and divides the text input into various prosodic units. The combination of the phonetic transcriptions and the prosodic information creates the symbolic linguistic representation for the text input.
There are two main synthesizer technologies for generating synthetic speech waveforms. Concatenative synthesis is based on the concatenation of segments of recorded speech. Concatenative synthesis generally gives the most natural sounding synthesized speech. The other synthesizer technology is formant synthesis where the output synthesized speech is generated using an acoustic model employing time-varying parameters such as fundamental frequency, voicing, and noise level. There are other synthesis methods such as articulatory synthesis based on computational model of the human vocal tract, hybrid synthesis of concatenative and formant synthesis, and Hidden Markov Model (HMM)-based synthesis.
In concatenative text-to-speech synthesis, the speech waveform corresponding to a given sequence of phonemes is generated by concatenating pre-recorded segments of speech. These segments are often extracted from carefully selected sentences uttered by a professional speaker, and stored in a database known as a voice table. Each such segment is typically referred to as a unit. A unit may be a phoneme, a diphone (the span between the middle of a phoneme and the middle of another), or a sequence thereof. A phoneme is a phonetic unit in a language that corresponds to a set of similar speech realizations (like the velar \k\ of cool and the palatal \k\ of keel) perceived to be a single distinctive sound in the language.
In a typical concatenative synthesis system, a text phrase input is first processed to convert to an input phonetic data sequence of a symbolic linguistic representation of the text phrase input. A unit selector then retrieves from the speech segment database (voice table) descriptors of candidate speech units that can be concatenated into the target phonetic data sequence. The unit selector also creates an ordered list of candidate speech units, and then assigns a target cost to each candidate. Candidate-to-target matching is based on symbolic feature vectors, such as phonetic context and prosodic context, and numeric descriptors, and determines how well each candidate fits the target specification. The unit selector determines which candidate speech units can be concatenated without causing disturbing quality degradations such as clicks, pitch discontinuities, etc., based on a quality degradation cost function, which uses candidate-to-candidate matching with frame-based information such as energy, pitch and spectral information to determine how well the candidates can be joined together. The job of the selection algorithm is to find units in the database which best match this target specification and to find units which join together smoothly. The best sequence of candidate speech units is selected for output to a speech waveform concatenator. The speech waveform concatenator requests the output speech units (e.g. diphones and/or polyphones) from the speech unit database. The speech waveform concatenator concatenates the speech units selected forming the output speech that represents the input text phrase.
The quality of the synthetic speech resulting from concatenative text-to-speech (TTS) synthesis is heavily dependent on the underlying inventory of units, i.e. voice table database. A great deal of attention is typically paid to issues such as coverage (i.e. whether all possible units are represented in the voice table), consistency (i.e. whether the speaker is adhering to the same style throughout the recording process), and recording quality (i.e. whether the signal-to-noise ratio is as high as possible at all times).
The issue of coverage is particularly salient, because of the inevitable degradation which is suffered when substituting an alternative unit for the optimal one when the latter is not present in the voice table. The availability of many such unit candidates can permit prosodic and other linguistic variations in the speech output stream. Achieving higher coverage usually means recording a larger corpus, especially when the basic unit is polyphonic, as in the case of words. Voice tables with a footprint close to 1 GB are now routine in server-based applications. The next generation of TTS systems could easily bring forth an order of magnitude increase in the size of the typical database, as more and more acoustico-linguistic events are included in the corpus to be recorded. The following prior art describes speech synthesis systems: U.S. Patent Application Publication No. 2005/0182629; Impact of Durational Outliers Removal from Unit Selection Catalogs, by John Kominek and Alan W. Black, 5th ISCA Speech Synthesis Workshop, Pittsburgh; Automatically Clustering Similar Units for Unit Selection in Speech Synthesis, by Alan W. Black and Paul Taylor, 1997.
Unfortunately, such large sizes are not practical for deployment in certain data processing environments. Even after applying standard file compression techniques, the resulting TTS system may be too big to ship as part of the distribution of a software package, such as an operating system.
It would therefore be desirable to develop a totally unsupervised, fully scalable pruning solution for a voice table for reducing the size of the database while maintaining coverage.