The invention relates to a method for lengthening an audio equivalent input signal, the method comprising:
positioning a first chain of mutually overlapping or adjacent time windows with respect to the signal; each time window being associated with a respective window function, PA1 forming a first sequence of signal segments by weighting the signal according to the associated window function of a respective window of the first chain of windows; and PA1 synthesising a lengthened audio signal by systematically maintaining or repeating respective signal segments of the first sequence of segments. PA1 positioning means for positioning a first chain of mutually overlapping or adjacent time windows with respect to the signal; each time window being associated with a respective window function, PA1 segmenting means for forming a first sequence of signal segments by weighting the signal according to the associated window function of a respective window of the first chain of windows; and PA1 synthesising means for synthesising a lengthened audio signal by systematically maintaining or repeating respective signal segments of the first sequence of segments. PA1 positioning a second chain of mutually overlapping or adjacent time windows with respect to the signal section; at least some of the time windows of the second chain having a duration not equal to a duration of the source signal segment and not equal to a multiple of the duration of the source signal segment; PA1 identification means for identifying a signal section in the lengthened audio signal which is synthesised from one of the signal segments, referred to as the source signal segment, by maintaining and at least once repeating the source signal segment; the source signal segment substantially having no periodic component; and PA1 means for breaking periodicity in the signal section caused by repeating the source signal segment by:
The invention further relates to an apparatus for lengthening an audio equivalent input signal, the apparatus comprising:
From EP-A 0527527, EP-A 0527529 and EP-A 0363233 a method and apparatus are known for lengthening an audio equivalent signal. The method and apparatus are typically used for speech synthesis. For speech synthesis usually a text is converted to speech by selecting speech fragments, representing sampled speech, from a set of stored speech fragments and concatenating the selected speech fragments to form a basic speech signal. The speech fragments may, for instance, represent diphones. Since the speech fragments have a given duration and pitch, the duration and usually also the pitch of the obtained basic speech signal is manipulated to obtain natural sounding speech with a given prosody. The manipulation is performed by breaking the basic speech signal into segments. The segments are formed by positioning a chain of windows along the signal. Successive windows are usually displaced over a duration similar to the local pitch period. In the system of EP-A 0527527 and EP-A 0527529, referred to as the PIOLA system, the local pitch period is automatically detected and the windows are displaced according to the detected pitch duration. In the so-called PSOLA system of EP-A 0363233 the windows are centred around manually determined locations, so-called voice marks. The voice marks correspond to periodic moments of strongest excitation of the vocal cords. The speech signal is weighted according to the window function of the respective windows to obtain the segments. A lengthened signal is obtained by repeating segments (e.g. repeating one in four segments to get a 25% longer signal). Similarly, a shortened signal can be achieved by suppressing segments. The same technique can be used for manipulating the duration of other forms of audio equivalent signals, such as music. For music, the displacement of windows may be based on the dominant local frequency component, similar to using the pitch or voice marks for speech signals. The duration of a music or music/speech signal may be manipulated in order to fit the signal to a given frameworks, such as fitting soundtrack(s) to a video track.
For manipulating the length of an audio signal, the window function may be a block form. This results in effectively cutting the input signal into non-overlapping neighbouring segments. Particularly for manipulating the prosody of a speech signal, it is preferred to use windows which are wider than the displacement of the windows (i.e. the windows overlap). Preferably each window extends to the centre of the next window. In this way each point in time of the speech signal is covered by two windows. The window function varies as a function of the position in the window, where the function approaches zero near the edge of the window. Preferably, the window function is "self-complementary" in the sense that the sum of the two window functions covering the same time point in the signal is independent of the time point (an example of such window function is a bell-shaped function formed by the square of a cosine with its argument running proportionally to time from minus ninety degrees at the beginning of the window to plus ninety degrees at the end of the window). Using windows which are wider than the displacement results in obtaining overlapping segments. The self complementary property of the window function ensures that by superposing the segments in the same time relation as they are derived, the original signal is retrieved. A pitch change of locally periodic signals (like for example voiced speech or music) can be obtained by placing the segment signals at different relative time points before superpositioning the segments. To form, for example, an output signal with increased pitch, the segments are superposed with a compressed mutual centre to centre distance as compared to the distance of the segments as derived from the original signal. The length of the segments are kept the same. Changing the time position of the segments results in an output signal which differs from the input signal in that it has a different local period, but the envelope of its spectrum remains approximately the same. Perception experiments have shown that this yields a very good perceived speech quality even if the pitch is changed by more than an octave.
The segmenting technique can also be used to manipulate the duration of parts of the audio equivalent signal which do not have a periodic component. For a speech signal this relates, for instance, to predominantly voiceless parts and for music to predominantly noise parts. For these parts of the signal the windows are displaced, for instance, by using the displacement used for the last segment with a distinguishable periodic component or using an average displacement value, such as 10 msec. for a male voice. In principle, also the spectral content of the signal may be analysed to identify fragments wherein the spectral content does not significantly change. If it is then desired to lengthen the signal by a given factor a/b (e.g. the signal should be lengthened by a factor 5/4), the fragment may be broken into b segments (or a multiple of b) and, by repeating the segments, the b input segment can give a output segments (e.g. repeating one in four segments).
In practice, it has been found that lengthening non-periodic parts in this way produces audible artefacts if the duration of the signal is substantially increased, e.g. by a factor of two or more. Although the segments itself does not contain identifiable periodic components, the repeating of the segments introduces periodicity. This is observed as a sound similar to a person blowing along the end of a tube. To avoid such artefacts, usually non-periodic parts of the input signal are not lengthened. Particularly for speech synthesis it is desired to be able to significantly increase the length of a speech signal. For a natural sounding audio signal it is desired that also the voiceless parts of the signal can be lengthened.