The present invention relates to a system and method for increasing the speed of a unit selection synthesis system for concatenative speech synthesis and, more particularly, to predetermining a universe of phonemesxe2x80x94selected on the basis of their triphone contextxe2x80x94that are potentially used in speech. Real-time selection is then performed from the created phoneme universe.
A current approach to concatenative speech synthesis is to use a very large database for recorded speech that has been segmented and labeled with prosodic and spectral characteristics, such as the fundamental frequency (F0) for voiced speech, the energy or gain of the signal, and the spectral distribution of the signal (i.e., how much of the signal is present at any given frequency). The database contains multiple instances of speech sounds. This multiplicity permits the possibility of having units in the database that are much less stylized than would occur in a diphone database (a xe2x80x9cdiphonexe2x80x9d being defined as the second half of one phoneme followed by the initial half of the following phoneme, a diphone database generally containing only one instance of any given diphone). Therefore, the possibility of achieving natural speech is enhanced with the xe2x80x9clarge databasexe2x80x9d approach.
For good quality synthesis, this database technique relies on being able to select the xe2x80x9cbestxe2x80x9d units from the databasexe2x80x94that is, the units that are closest in character to the prosodic specification provided by the speech synthesis system, and that have a low spectral mismatch at the concatenation points between phonemes. The xe2x80x9cbestxe2x80x9dsequence of units may be determined by associating a numerical cost in two different ways. First, a xe2x80x9ctarget costxe2x80x9d is associated with the individual units in isolation, where a lower cost is associated with a unit that has characteristics (e.g., F0, gain, spectral distribution) relatively close to the unit being synthesized, and a higher cost is associated with units having a higher discrepancy with the unit being synthesized. A second cost, referred to as the xe2x80x9cconcatenation costxe2x80x9d, is associated with how smoothly two contiguous units are joined together. For example, if the spectral mismatch between units is poor, perhaps even corresponding to an audible xe2x80x9cclickxe2x80x9d, there will be a higher concatenation cost.
Thus, a set of candidate units for each position in the desired sequence can be formulated, with associated target costs and concatenative costs. Estimating the best (lowest-cost) path through the network is then performed using a Viterbi search. The chosen units may then be concatenated to form one continuous signal, using a variety of different techniques.
While such database-driven systems may produce a more natural sounding voice quality, to do so they require a great deal of computational resources during the synthesis process. Accordingly, there remains a need for new methods and systems that provide natural voice quality in speech synthesis while reducing the computational requirements.
The need remaining in the prior art is addressed by the present invention, which relates to a system and method for increasing the speed of a unit selection synthesis system for concatenative speech and, more particularly, to predetermining a universe of phonemes in the speech database, selected on the basis of their triphone context, that are potentially used in speech, and performing real-time selection from this precalculated phoneme universe.
In accordance with the present invention, a triphone database is created where for any given triphone context required for synthesis, there is a complete list, precalculated, of all the units (phonemes) in the database that can possibly be used in that triphone context. Advantageously, this list is (in most cases) a significantly smaller set of candidates units than the complete set of units of that phoneme type. By ignoring units that are guaranteed not to be used in the given triphone context, the selection process speed is significantly increased. It has also been found that speech quality is not compromised with the unit selection process of the present invention.
Depending upon the unit required for synthesis, as well as the surrounding phoneme context, the number of phonemes in the preselection list will vary and may, at one extreme, include all possible phonemes of a particular type. There may also arise a situation where the unit to be synthesized (plus context) does not match any of the precalculated triphones. In this case, the conventional single phoneme approach of the prior art may be employed, using the complete set of phonemes of a given type. It is presumed that these instances will be relatively infrequent.