1. Field of the Invention
The invention relates to a digital speech-synthesis process.
2. The Prior Art
Three processes are essentially known in the synthetic generation of speech with computers.
In formant synthesis, the resonance properties of the human vocal tract and its variations in the course of speaking, which are caused by the movements of the speech organs, are simulated by a filtered excitation signal. Such resonances are characteristic of the structure of vowels and their perception. For limiting the computing expenditure, the first three to five formants of a speech are generated synthetically with the excitation source. Therefore, with this type of synthesis, the memory location requirements in a computer are low. Furthermore, a simple change can be realized in duration and in the fundamental of the rule set excitation waveforms. However, the drawback is that an extensive rule set is needed for speech synthesis output, which often requires the use of digital processors. Furthermore, it is a disadvantage that the speech output sounds unnatural and metallic, and that it has special weak points in connection with nasals and obstruents, i.e., with plosives /p, t, k, b, d, g/, affricates /pf, ts, tS/ and fricatives /f, v, s, z, S, Z, C, j, x, h/.
In the present text, the letters shown between slashes (//) represent sound symbols according to the SAMPA-notation; cf: Wells, J.; Barry, W. J.; Grice, M.; Fourcin, A.; Gibbon, D. [1992]; Standard Computer-Compatible Transcription, in: ESPRIT PROJECT 2589 (SAM); Multi-Lingual Speech Input/ Output Assessment, Methodology and Standardization; Final Report; Doc. SAM-UCL-037, pp 29 ff.
In articulatory synthesis, the acoustic conditions in the vocal tract are modeled, so that the articulatory gestures and movements during speaking are simulated mathematically. Thus an acoustic model of the vocal tract is computed, which leads to substantial computing expenditure and which requires a high computing capacity. However, the automatic speech generated this way still sounds unnatural and technical.
Furthermore, the concatenation synthesis is known, where parts of really spoken utterances are concatenated in such a way that new utterances are generated. The individual speech segments thus form units for the generation of speech. The size of the segments may reach from words and phrases up to parts of sounds depending on the field of application. Demi-syllables or smaller demi-units can be used for speech synthesis with an unlimited vocabulary. Larger units are useful only if a limited vocabulary is to be synthesized.
In systems which do not use resynthesis, the choice of the correct cutting point of the speech components is decisive for the synthesis quality, and melodic and spectral jumps have to be avoided. Concatenative synthesis processes then achieve--especially with larger units--a more natural sound than the other methods. Furthermore, the controlling expenditure for generating the sounds is quite low. The limitations of this process lie in the relatively high memory requirements for the speech components needed. Another limitation of this process is that components, once recorded, can be changed (e.g. in duration or frequency) in the known systems only by costly resynthesis methods, which, furthermore, have an adverse effect on the sound of the speech and its comprehensibility. For this reason, also a number of different realizations of a speech unit are recorded, which, however, increases memory requirements.
The concatenation synthesis processes essentially comprise four synthesis methods permitting the speech synthesis without limitation of the vocabulary.
A concatenation of sounds or phones is carried out in phone synthesis. For Western European languages with a sound inventory of about 30 to 50 sounds and an average sound duration of about 150 ins, the memory requirements are acceptably low. However, these speech signal units lack the perceptively important transitions between the individual sounds, which, furthermore, can be recreated only incompletely by fading over individual sounds or even more complicated resynthesis methods. The quality of synthesis is, therefore, not satisfactory. Even storing allophonic variants of sounds in separate speech signal units in the so called allophone synthesis does not significantly enhance the speech result due to disregard of the articulatory-acoustic dynamics.
The most widely applied form of concatenation synthesis is the diphone synthesis, which employs speech signal units reaching from the middle of an acoustically defined speech sound up to the middle of the next speech sound. The perceptually important transitions from one sound to the next are taken into account in this way, such transitions appearing in the acoustic signal as a result of the movements of the speech organs. Furthermore, the speech signal units are thus concatenated at spectrally relatively constant places, which reduces the potentially present interferences of the signal flow on the joints of the individual diphones. The sound inventory of Western European languages consists of 35 to 50 sounds. For a language with 40 sounds, this theoretically results in 1600 pairs of diphones, which are then really reduced to about 1000 by phonotactic constraints. In natural speech, unstressed and stressed sounds differ in sound quality and duration. Different diphones are recorded in some systems for stressed and unstressed sound pairs in order to adequately take said differences into account in the synthesis. Therefore, 1000 to 2000 diphones with an average duration of about 150 ms are required depending on the projected configuration, resulting in a memory requirement for the speech signal units of up to 23 MB depending on the requirements with respect to dynamics and signal bandwidth. A common value amounts to approximately 8 MB.
The triphone and the demi-syllable syntheses are based on a principle similar to the one of the diphone synthesis. In this case too, the cutting point is disposed in the middle of the sounds. However, larger units are covered, which permits taking into account larger phonetic contexts. However, the number of combinations increases proportionally. In demi-syllable synthesis, one cutting point for the units used is in the middle of the vowel of a syllable. The other cutting point is at the beginning or at the end of a syllable, so that depending on the syllable structure, speech signal units can consist of sequences of several consonants. In German, about 52 different sound sequences exist in starting syllables of morphemes, and about 120 sound sequences for medial or final syllables of morphemes, resulting in a theoretical number of 6240 demi-syllables for the German language, of which some are uncommon. As demi-syllables are mostly longer than diphones, the memory requirements for speech signal units exceed those with diphones considerably.
The substantial memory requirements therefore pose the greatest problem in connection with a high-quality speech synthesis system. For reducing these requirements, it has been proposed, for example to exploit the silence in the closure of plosives for all plosive closures. A speech synthesis system is known from EP 0 144 731 Bl, where segments of diphones are used for several sounds. Said document describes a speech synthesizer which stores speech signal units which are generated by dividing a pair of sounds and relates such units with defined expression symbols. A synthesizing device reads the standard speech signal units from the memory in accordance with the output symbols of the converted sequence of expression symbols. Based on the speech segment of the input symbols it is determined whether two read standard speech signal units are connected directly when the input speech segment of the input symbols is voiceless, or whether a preset first interpolation process is applied when the input speech segment of the input symbol is voiced, the same standard signal unit being used both for a voiced sound /g, d, b/ and for its corresponding voiceless sound /k, t, p/. Furthermore, standard speech signal units representing the vowel segment after a consonant or the vowel segment preceding a consonant are to be stored in the memory as well. The transition ranges from a consonant to a vowel, or from a vowel to a consonant, can be equated in each case for the consonants k and g, t and d, as well as p and b. Respectively the memory requirements are reduced in this way; however, the aforementioned interpolation process requires a not insignificant computing expenditure.
A process for the synthesis of speech is known from DE 27 40 520 Al, in which each phone is formed by a phoneme stored in a memory, periods of sound oscillations being obtained from natural speech or are synthesized artificially. The text to be synthesized is grammatically and phonetically analyzed sentence by sentence according to the rules of the language. In addition to the periods of the sound oscillations, each phoneme is opposed to certain types and a number of time slices of noise phonemes with the respective duration, amplitude, and spectral distribution. The periods of the sound oscillations and the elements of the noise phonemes are stored in a memory in the digital form as a sequence of amplitude values of the respective oscillation, and are changed in the reading process according to the frequency characteristic or in order to increase the naturalness.
Accordingly, a digital speech synthesis process according to the concatenation principle and conforming to the introductory part of patent claim 1 is known from that document.
So as to make memory requirements as low as possible, individual periods of sound oscillations with characteristic formant distribution are stored according to the synthesis process of DE 27 40 520 Al. While maintaining the basic characteristics of the sentence, the types and the number of stored periods of sound oscillations associated with each phoneme are determined and then jointly form the acoustic speech impression. Accordingly, extremely short units of the length of a period of the basic oscillation of a sound are recalled from the memory and successively repeated depending on the number of repetitions previously determined. In order to realize smooth phoneme transitions synthetic periods with formant distributions which correspond to the transition between phonemes are used, or the amplitudes within the range of the respective transitions are reduced.
The drawback is that no adequate naturalness of the speech reproduction is achieved, because of the multiple reproduction of identical period segments, which may be reduced or extended only synthetically, if need be. Moreover, the substantially reduced memory requirements are gained at the expense of increased analysis and interpolation expenditure, costing computing time.
A process similar to the speech-synthesis process of DE 27 40 520 Al is known from WO 85/04747, which, however, is based on a completely synthetic generation of the speech segments. The speech segments, which represent phonemes or transitions, are generated from synthetic waveforms, which are reproduced repeatedly in a predetermined manner, if necessary reduced in length and/or voiced. Especially at phoneme transitions, an inverted reproduction of certain units is used as well. It is a drawback also in this process that even though the memory location requirements are considerably reduced, a substantial computing capacity is required due to extensive analyzing and synthesizing processes. Furthermore, the speech reproduction lacks the natural variance.