The present invention relates to the field of speech synthesis and concerns a new technique to synthesize speech from unrestricted written text.
Text-to-speech synthesis is usually obtained by computing, for each sentence to be synthesized, an intonation contour and the spectra features sequence that represents the phonetic information to synthesize. Correct spectral representation of speech is a major issue in speech synthesis. The prior art methods stem from two general approaches: concatenation synthesis and synthesis by rules.
Concatenation synthesis is based on a proper representation, usually Linear Prediction Coding (LPC), of prerecorded segments of speech, that are stretched and adjoined together in order to construct the desired synthetic speech.
Synthesis by rules, known also as formant synthesis, provides a spectral description of the steady states for each phoneme. Spectra between two adjacent phonemes are then interpolated on the basis of rules drawn by human phoneticians.
The drawbacks of the prior art are that the first method requires a large set of segments (hundreds or more) that are to be extracted from natural speech and the second method requires a high degree of phonetic knowledge. The above requirements, together with the intrinsic complexity of rules, have limited the dissemination of synthesizers using the above methods. Furthermore, generally a text-to-speech synthesizer is strictly language dependent. In fact, phonetic rules vary from one language to another, as well as the speech segments to be used in concatenation synthesis, so that the complexity of customizing a synthesizer to another language is close to that of designing a completely new synthesizer.