The invention relates to a method of coding an audio equivalent signal. The invention also relates to an apparatus for coding an audio equivalent signal. The invention further relates to a method of synthesising an audio equivalent signal from encoded signal fragments.
The invention also relates to a system for synthesising an audio equivalent signal from encoded audio equivalent input signal fragments. The invention further relates to a synthesiser.
The invention relates to a parametric production model for coding an audio equivalent signal. A widely used coding technique based on a parametric production model is the so-called Linear Predictive Coding, LPC, technique. This technique is particularly used for coding speech. The coded signal may, for instance, be transferred via a telecommunications network and decoded (resynthesised) at the receiving station or may be used in a speech synthesis system to synthesise speech output representing, for instance, textual input. According to the LPC model the spectral energy envelope of an audio equivalent signal is described in terms of an optimum all-pole filter and a gain factor that matches the filter output to the input level. For speech, a binary voicing decision determines whether a periodic impulse train or white noise excites the LPC synthesis filter. For running speech the, model parameters, i.e. voicing, pitch period, gain and filter coefficients are updated every frame, with a typical duration of 10 msec. This reduces the bit rate drastically. Although a classical LPC vocoder can produce intelligible speech, it often sounds rather buzzy. LPC is based on autocorrelation analysis and simply ignores the phase spectrum. The synthesis is minimum phase. A limitation of the classical LPC is the binary selection of either a periodic or a noise source. In natural speech both sources often act simultaneously. Not only in voiced fricatives but also in many other voiced sounds. An improved LPC coding technique is known from xe2x80x9cA mixed excitation LPC vocoder model for low bit rate speech codingxe2x80x9d, McCree and Barnwell, IEEE Transactions on speech and audio processing, Vol. 3, No. 4, July 1995. According to this coding technique, a filter bank is used to split the input signal into a number of, for instance five, frequency bands. For each band, the relative pulse and noise power is determined by an estimate of the voicing power strength at that frequency in the input speech. The voicing strength in each frequency band is chosen as the largest of the correlation of the bandpass filtered input speech and the correlation of the envelope of the bandpass filtered speech. The LPC synthesis filter is excited by al frequency weighted sum of a pulse train and white noise.
In general the quality obtained by LPC is relatively low and therefore LPC is mainly used for communication purposes at low bitrates (e.g. 2400/4800 bps). Even the improved LPC coding is not suitable for systems, such as speech synthesis (text-to-speech), where a high quality output is desired. Using the LPC coding methods a great deal of naturalness is still lacking. This has hampered large scale application of synthetic speech in e.g. telephone services or automatic traffic information systems in a car environment.
It is an object of the invention to provide a parametric coding/synthesis method and system which enables the production of more natural speech.
To meet the object of the invention, the method of coding an audio equivalent signal comprises:
determining successive pitch periods/frequencies in the signal;
forming a sequence of mutually overlapping or adjacent analysis segments by positioning, a chain of time windows with respect to the signal and weighting the signal according to an associated window function of the respective time window;
for each of the analysis segments:
determining an amplitude value and a phase value for a plurality of frequency Components of the analysis segment, including a plurality of harmonic frequencies of the pitch frequency corresponding to the analysis segment,
determining a noise value for each of the frequency components by comparing the phase value for the frequency component of the analysis segment to a corresponding phase value for at least one preceding or following analysis segment; the noise value for a frequency component representing a contribution of a periodic component and an aperiodic component to the analysis segment at the frequency; and
representing the analysis segment by the amplitude value and the noise value for each of the frequency components.
The inventor has found that an accurate estimate of the ratio between noise and the periodic component is achieved by pitch synchronously analysing the phase development of the signal, instead of (or in addition to) analysing the amplitude development. This improved detection of the noise contribution can be used to improve the prior art LPC encoding. Advantageously, the coding is used for speech synthesis systems.
In an embodiment according to the invention as described in the dependent claim 2, the, analysis window is very narrow. In this way, the relatively quick change of xe2x80x98noisinessxe2x80x99 which can occur in speech can be accurately detected.
In an embodiment according to the invention as decried in the dependent claim 3, the pitch development is accurately determined using a two step approach. After obtaining a rough estimate of the pitch, the signal is filtered to extract the frequency components near the detected pitch frequency. The actual pitch is detected in the pitch filtered signal.
In an embodiment according to the invention as described in the dependent claim 4, the filtering is based on convolution with a sine/cosine pair within a segment, which allows for an accurate determination of the pitch frequency component within the segment.
In an embodiment according to the invention as described in the dependent claim 5, interpolation is used for increasing the resolution for sampled signals.
In an embodiment according to the invention as described in the dependent claim 6, the amplitude and/or phase value of the frequency components are determined by a transformation to the frequency domain using the accurately determined pitch frequency as the fundamental frequency of the transformation. This allows for an accurate description of the periodic part of the signal.
In an embodiment according to the invention as described in the dependent claim 7, the noise value is derived from the difference of the phase value for the frequency component of the analysis segment and the corresponding phase value of at least one preceding or following analysis segment. This is a simple way of obtaining a measure for how much noise is present at that frequency in the signal. If the signal is highly dominated by the periodic signal, with a very low contribution of noise, the phase will substantially be the same.,On the other hand for a signal dominated by noise, the phase will xe2x80x98randomlyxe2x80x99 change. As such the comparison of the phase provides an indication for the contribution of the periodic and aperiodic components to the input signal. It will be appreciated that the measure may also be based on phase information from more than two segments (e.g. the phase information from both neighbouring segments may be compared to the phase of the current segment).
In an embodiment according to the invention as described in the dependent claim 8, the noise value is based on a difference of a derivative of the phase value for the frequency component of the analysis segment and of the corresponding phase value of at least one preceding or following analysis segment. This provides a more robust measure.
To meet the object of the invention, the method of synthesising an audio equivalent signal from encoded audio equivalent input signal fragments, such as diphones, comprises:
retrieving selected ones of coded signal fragments, where the signal fragments have been coded according to the described coding method; and
for each of the retrieved coded signal fragments creating a corresponding signal fragment by transforming the signal fragment to a time domain, where for each of the coded frequency components an aperiodic signal component is added in accordance with the respective noise value for the frequency component.
In this way a high quality synthesis signal can be achieved. So far, reasonable quality synthesis speech has been achieved by concatenating recorded actual speech fragments, such as diphones. With these techniques a high level of naturalness of the output can be achieved within a fragment. The speech fragments are selected and concatenated in a sequential order to produce the desired output. For instance, a text input (sentence) is transcribed to a sequence of diphones, followed by obtaining the speech fragments (diphones) corresponding to the transcription. Normally, the recorded speech fragments do not have the pitch frequency and/or duration corresponding to the desired prosody of the sentence to be spoken. The manipulation may be performed by breaking the basic speech signal into segments. The segments are formed by positioning a chain of windows along the signal. Successive windows are usually displaced over a duration similar to the local pitch period. In the system of EP-A 0527527 and EP-A 0527529, referred to as the PIOLA system, the local pitch period is automatically detected and the windows are displaced according to the detected pitch duration. In the so-called PSOLA system of EP-A 0363233 the windows are centred around manually determined locations, so-called voice marks. The voice marks correspond to periodic moments of strongest excitation of the vocal cords. The speech signal is weighted according to the window function of the respective windows to obtain the segments. An output sign al is produced by concatenating the signal segments. A lengthened output signal is obtained by repeating segments (e.g. repeating one in four segments to get a 25% longer signal). Similarly, a shortened output signal can be achieved by suppressing segments. The pitch of the output signal is raised, respectively, lowered by increasing or, respectively, lowering the overlap between the segments. Applied on running speech the quality of speech manipulated in this way can be very high, provided the range of the pitch changes is not too large. Complications arise, however, if the speech is built from relatively short speech fragments, such as diphones. The harmonic phase courses of the voiced speech parts may be quite different and it is difficult to generate smooth transitions at the borders between successive fragments, reducing the naturalness of the synthesised speech. In such systems the coding technique according to the invention can advantageously be applied. By not operating on the actual audio equivalent fragments with uncontrollable phase, instead fragments are created from the encoded fragments according to the invention. Any suitable technique may be used to decode the fragments followed by a segmental manipulation according to the PIOLA/PSOLA technique. Using a suitable decoding technique, the phase of the relevant frequency components can be fully controlled, so that uncontrolled phase transitions at fragment boundaries can be avoided. Preferably, sinusoidal synthesis is used for decoding the encoded fragments.