1. Field of the Invention
The invention relates to a method for defining a sequence of sound modules for synthesis of a speech signal in a tonal language, corresponding to a predetermined sequence of speech modules.
2. Description of the Related Art
Automatic methods, carried out by computers, for synthesis of tonal languages, such as Chinese, in particular Mandarin or Thai, normally use sound modules which each represent one syllable, since tonal languages generally have relatively few syllables. These sound modules are concatenated to form a speech signal, in which process it is necessary to take into account the fact that the significance of the syllables is dependent on the pitch.
Since these known methods have a set of sound modules which must include all the syllables in various variants and contexts, a considerable amount of computation power is required in a computer to carry out this process automatically. This computation power is often not available in mobile telephone applications.
In applications with a high level of computation power, the known methods for synthesis of tonal languages have the disadvantage that the given set of syllables does not allow correct synthesis of specific expressions which contain syllables that are not stored in this set, even though sufficient computation power may be available.
These known methods have been proven in practice. However, they are not very flexible since they frequently cannot be adapted to applications where there is little computation power and they do not fully utilize capabilities provided by high computation parallels.
A method for language synthesis, which relates to synthesis of European languages, is explained in the thesis “Konkatenative Sprachsynthese mit groβen Datenbanken” [Concatenated speech synthesis using large databanks], Martin Holzapfel, TU Dresden, 2000. In this method, individual sounds are stored in their specific left-to-right context as sound modules. Based on “The HTK book, version 2.2” Steve Young, Dan Kershaw, Julian Odell, Dave Ollason, Valtcho Valtchev and Phil Woodland, Entropic Ltd., Cambridge 1999, these sound modules are referred to as triphones. In this sense, triphones are sound modules of an individual phon, although it is necessary to take account of the context of a preceding phon and of a subsequent phon in this case.
In this known method, a group of sound modules (triphones) is stored in a databank for each speech module, which generally comprises one letter. Suitability functions are used to determine suitability distances for sound modules in the respective speech modules, with the suitability distances quantitatively describing the suitability of the respective sound module for representation of the speech module, or of the sequence of the speech modules. The suitability distances can in this case be determined using the following criteria:                representativeness of the sound modules;        manipulation of the sound duration;        manipulation of the sound energy;        manipulation of the fundamental frequency.        
When determining the representativeness of the sound modules, a typical spectral centroid of the group of sound modules is defined and a value which is indirectly proportional to the spectral distance between the respective sound module and the centroid is defined as the suitability distance.
When sound modules are concatenated, the fundamental frequency must be manipulated, as a result of which the sound duration and sound energy are also influenced. The corresponding suitability functions are used to determine a measure of the discrepancy from the original state of the sound module as a result of the manipulation.
A method for determining a sound module which is representative of the speech module is known from DE 197 36 465.9. In this document, the suitability functions are referred to as association functions, and the suitability distance is referred to as the selection measure. Otherwise, this method corresponds to the method described in the thesis cited above.