1. Field of Invention
The invention relates to a method and apparatus for performing concatenative speech synthesis using half-phonemes. In particular, a technique is provided for combining two methods of speech synthesis to achieve a level of quality that is superior to either technique used in isolation.
2. Description of Related Art
There are two categories of speech synthesis techniques frequently used today, diphone synthesis and unit selection synthesis. In diphone synthesis, a diphone is defined as the second half of one phoneme followed by the initial half of the following phoneme. At the cost of having N.times.N (capital N being the number of phonemes in a language or dialect) speech recordings, i.e., diphones, in a database, one can achieve high quality synthesis. An appropriate sequence of diphones are concatenated into one continuous signal using a variety of techniques (e.g., time-domain Pitch Synchronis Overlap and Add (TD-PSOLA)). For example, in English, N would equal between 40-45 phonemes depending on regional accent and the phoneme set definition.
This approach does not, however, completely solve the problem of providing smooth concatenation, nor does it solve the problem of providing natural-sounding synthetic speech. There is generally some spectral envelope mismatch at the concatenation boundaries. For severe cases, depending on the treatment of the signals, a signal may exhibit glitches or there may be degradation in the clarity of the speech. Consequently, a great deal of effort is often spent on choosing appropriate diphone units that will not have these defects irrespective of which other units they are matched with. Thus, in general, much effort is devoted to preparing a diphone set and selecting sequences that are suitable for recording and in verifying that the recordings are suitable for the diphone set.
Another approach to concatenative synthesis is to use a very large database for recorded speech that has been segmented and labeled with prosodic and spectral characteristics, such as the fundamental frequency (F0) for voiced speech, the energy or gain of the signal, and the spectral distribution of the signal (i.e., how much of the signal is present at any given frequency). The database contains multiple instances of speech sounds. This permits the possibility of having units in the database which are much less stylized than would occur in a diphone database where generally only one instance of any given diphone is assumed. Therefore, the possibility of achieving natural speech is enhanced.
For good quality synthesis, this technique relies on being able to select units from the database, currently only phonemes or a string of phonemes, that are close in character to the prosodic specification provided by the speech synthesis system, and that have a low spectral mismatch at the concatenation points. The "best" sequence of units is determined by associating a numerical cost in two different ways. First, a cost (target cost) is associated with the individual units (in isolation) so that a lower cost results if the unit has approximately the desired characteristics, and a higher cost results if the unit does not resemble the required unit. A second cost (concatenation cost) is associated with how smoothly units are joined together. Consequently, if the spectral mismatch is bad, there is a high cost associated, and if the spectral mismatch is low, a low cost is associated.
Thus, a set of candidate units for each position in the desired sequences (with associated costs), and a set of costs associated with joining any one to its neighbors, results. This constitutes a network of nodes (with target costs) and links (with concatenation costs). Estimating the best (lowest-cost) path through the network is done using a technique called Viterbi search. The chosen units are then concatenated to form one continuous signal using a variety of techniques.
This technique permits synthesis that sounds very natural at times but more often sounds very bad. In fact, intelligibility can be lower than for diphone synthesis. For the technique to adequately work, it is necessary to do extensive searching for suitable concatenation points even after the individual units have been selected. This is because phoneme boundaries are frequently not the best place to try to concatenate two segments of speech.