The invention relates to methods and devices of speech synthesis; it relates more particularly to synthesis from a dictionary of sound elements (also known as component sounds) by fractionating the text to be synthesized into microframes each identified by an order number of a corresponding sound element and by prosodic parameters (information concerning sound height at the beginning and at the end of the sound element and duration of the sound element), then by adaptation and concatenation of the sound elements by an adding overlapping procedure.
The sound elements stored in the dictionary will frequently be diphones, i.e. transitions between phonemes, which makes it possible, for the French language, to make to with a dictionary of about 1300 sound elements; different sound elements may however be used, for example, syllables or even words. The prosodic parameters are determined as a function of criteriae relating to the context; the sound height which corresponds to the intonation depends on the position of the sound element in a word and in the sentence and the duration given to the sound element depends on the rythm of the sentence.
It should be recalled that speech synthesis methods are divided into two groups. Those which use a mathematic model of the vocal tract (linear prediction synthesis, formant synthesis and fast Fourier transform synthesis) rely on a deconvolution of the source and of the transfer function of the vocal tract and generally require about 50 arithmetic operations per digital sample of the speech before digital-analog conversion and restoration.
This source-vocal duct deconvolution makes it possible to modify the value of the fundamental frequency of the voiced sounds, namely sounds which have a harmonic structure and are caused by vibration of the vocal cords, and compression of the data representing the speech signal.
Those which belong to the second group of processus use time-domain synthesis by concatenation of wave forms. This solution has the advantage of flexibility in use and the possibility of considerably reducing the number of arithmetic operations per sample. On the other hand, it is not possible to reduce the flow rate required for transmission as much as in the methods based on a mathematic model. But this drawback does not exist when good restoration quality is essential and there is no requirement to transmit data over a narrow channel.
Speech synthesis according to the present invention belong to the second group. It finds a particularly important application in the field of transformation of an orthographic chain (formed for example by the text delivered by a printer) into a speech signal, for example restored directly delivered or transmitted over a normal telephone line.
A speech synthesis process from sound elements using a short term signal add-overlap technique is already known (Diphone synthesis using an overlap-add technique for speech waveforms concatenation, Charpentier et al, ICASSP 1986, IEEE-IECEJ-ASJ International Conference on Acoustics Speech and Signal Processing, pp. 2015-2018). But it relates to short term synthesis signals with standardization of the overlap of the synthesis windows, obtained by a very complex procedure:
analysis of the original signal by synchronous windowing of the voicing; PA1 Fourier transform of the short-term signal; PA1 envelope detection; PA1 homothetic transformation of the frequential axis on the spectrum of the source; PA1 weighing of the modified source spectrum by the envelope of the original signal; PA1 reverse Fourier transform.