The present invention relates to synthesis-based pre-selection of suitable units for concatenative speech and, more particularly, to the utilization of a table containing many thousands of synthesized sentences for selecting units from a unit selection database.
A current approach to concatenative speech synthesis is to use a very large database for recorded speech that has been segmented and labeled with prosodic and spectral characteristics, such as the fundamental frequency (F0) for voiced speech, the energy or gain of the signal, and the spectral distribution of the signal (i.e., how much of the signal is present at any given frequency). The database contains multiple instances of speech sounds. This multiplicity permits the possibility of having units in the database that are much less stylized than would occur in a diphone database (a xe2x80x9cdiphonexe2x80x9d being defined as the second half of one phoneme followed by the initial half of the following phoneme, a diphone database generally containing only one instance of any given diphone). Therefore, the possibility of achieving natural speech is enhanced with the xe2x80x9clarge databasexe2x80x9d approach.
For good quality synthesis, this database technique relies on being able to select the xe2x80x9cbestxe2x80x9d units from the databasexe2x80x94that is, the units that are closest in character to the prosodic specification provided by the speech synthesis system, and that have a low spectral mismatch at the concatenation points between phonemes. The xe2x80x9cbestxe2x80x9d sequence of units may be determined by associating a numerical cost in two different ways. First, a xe2x80x9ctarget costxe2x80x9d is associated with the individual units in isolation, where a lower cost is associated with a unit that has characteristics (e.g., F0, gain, spectral distribution) relatively close to the unit being synthesized, and a higher cost is associated with units having a higher discrepancy with the unit being synthesized. A second cost, referred to as the xe2x80x9cconcatenation costxe2x80x9d, is associated with how smoothly two contiguous units are joined together. For example, if the spectral mismatch between units is poor, there will be a higher concatenation cost.
Thus, a set of candidate units for each position in the desired sequence can be formulated, with associated target costs and concatenative costs. Estimating the best (lowest-cost) path through the network is then performed using, for example, a Viterbi search. The chosen units may then concatenated to form one continuous signal, using a variety of different techniques.
While such database-driven systems may produce a more natural sounding voice quality, to do so they require a great deal of computational resources during the synthesis process. Accordingly, there remains a need for new methods and systems that provide natural voice quality in speech synthesis while reducing the computational requirements.
The need remaining in the prior art is addressed by the present invention, which relates to synthesis-based pre-selection of suitable units for concatenative speech and, more particularly, to the utilization of a table containing many thousands of synthesized sentences as a guide to selecting units from a unit selection database.
In accordance with the present invention, an extensive database of synthesized speech is created by synthesizing a large number of sentences (large enough to create millions of separate phonemes, for example). From this data, a set of all triphone sequences is then compiled, where a xe2x80x9ctriphonexe2x80x9d is defined as a sequence of three phonemesxe2x80x94or a phoneme xe2x80x9ctripletxe2x80x9d. A list of units (phonemes) from the speech synthesis database that have been chosen for each context is then tabulated.
During the actual text-to-speech synthesis process, the tabulated list is then reviewed for the proper context and these units (phonemes) become the candidate units for synthesis. A conventional cost algorithm, such as a Viterbi search, can then be used to ascertain the best choices from the candidate list for the speech output. If a particular unit to be synthesized does not appear in the created table, a conventional speech synthesis process can be used, but this should be a rare occurrence,
Other and further aspects of the present invention will become apparent during the course of the following discussion and by reference to the accompanying drawings.