Various types of speech synthesizers are known. Most operate using a repertoire of phonemes or allophones, which are generated in sequence to synthesise corresponding utterances. A review of some types of speech synthesizers may be found in A. Breen “Speech Synthesis Models: A Review”, Electronics and Communication Engineering Journal, pages 19–31, February 1992. Some types of speech synthesizers attempt to model the production of speech by using a source-filter approximation utilising, for example, linear prediction. Others record stored segments of actual speech, which are output in sequence.
A major difficulty with synthesised speech is to make the speech sound natural. There are many reasons why synthesised speech may sound unnatural. However, a particular problem with the latter class of speech synthesizers, utilising recorded actual speech, is that the same recording of each vowel or allophone is used on each occasion where the vowel or allophone in question is required. This becomes even more noticeable in those synthesizers where, to generate a sustained sound, a short segment of the phoneme or allophone is repeated several times in sequence.
The present invention, in one aspect, provides a speech synthesizer in which a speech waveform is directly synthesised by selecting a synthetic starting value and then selecting and outputting a sequence of further values, the selection of each further value being based jointly upon the value which preceded it and upon a model of the dynamics of actual recorded human speech.
Thus, a synthesised sequence of any required duration can be generated. Furthermore, since the progression of the sequence depends upon its starting value, different sequences corresponding to the same phoneme or allophone can be generated by selecting different starting values.
The present inventors have previously reported (“Speech characterisation by non-linear methods”, M. Banbrook and S. McLaughlin, submitted to IEEE Transactions on Speech and Audio Processing, 1996; “Speech characterisation by non-linear methods”, M. Banbrook and S. McLaughlin, presented at IEEE Workshop on non-linear signal and image processing, pages 396–400, 1995) that voiced speech, with which the present invention is primarily concerned, appears to behave as a low dimensional, non-linear, non-chaotic system. Voiced speech is essentially cyclical, comprising a time series of pitch pulses of similar, but not identical, shape. Therefore, in a preferred embodiment, the present invention utilises a low dimensional state space representation of the speech signal, in which successive pitch pulse cycles are superposed, to estimate the progression of the speech signal within each cycle and from cycle-to-cycle.
This estimate of the dynamics of the speech signal is useful in enabling the synthesis of a waveform which does not correspond to the recorded speech on which the analysis of the dynamics was based, but which consists of cycles of a similar shape and exhibiting a similar variability to those on which the analysis was based.
For example, the state space representation may be based on Takens' Method of Delays (F. Takens, “Dynamical Systems and Turbulence”, Vol. 898 of Lecture Notes in Mathematics, pages 366–381. Berlin: Springe 1981). In this method, the different axes of the state space consist of waveform values separated by predetermined time intervals, so that a point in state space is defined by a set of values at t1, t2, t3 (where t2−t1=Δ1 and t3−t2=Δ2, which are both constants and may be equal).
Another current problem with synthesised speech is that where different sounds are concatenated together into a sequence, the “join” is sometimes audible, giving rise to audible artifacts such as a faint modulation at the phoneme rate in the synthesised speech.
Accordingly, in another aspect the present invention provides a method and apparatus for synthesising speech in which an interpolation is performed between state space representations of the two speech sounds to be concatenated, or, in general, between correspondingly aligned portions of each pitch period of the two sounds. Thus, one pitch pulse shape is gradually transformed into another.