The invention relates to a method for manipulating an audio equivalent signal. Such a method involves positioning a chain of mutually overlapping time windows with respect to the audio equivalent signal; deriving segment signals from the audio equivalent signal, each of the segment signals being derived from the audio equivalent signal by weighting the audio equivalent signal as a function of position in a respective window; and synthesizing, by chained superposition, the segment signals.
The invention also relates to a method for manipulating a concatenation of a first and a second audio equivalent signal. Such a method comprise the steps of:
(a) locating the second audio equivalent signal at a position in time relative to the first audio signal, the position in time being such that, over time during a first time interval only, the first audio equivalent signal is active and in a subsequent second time interval only the second audio equivalent signal is active, PA1 (b) positioning a chain of mutually overlapping time windows with respect to the first and the second audio equivalent signal, and PA1 (c) synthesizing an output audio signal by chained superposition of segment signals derived from the first and/or the second audio equivalent signal by weighting the first and/or the second audio equivalent signal as a function of position in the time windows. PA1 (a) a positioning unit for locating a position for a time window with respect to the audio equivalent signal, the positioning unit feeding the position to PA1 (b) a segmenting unit for deriving a segment signal from the audio equivalent signal by weighting the audio equivalent signal as a function of position in the window, the segmenting unit feeding the segment signal to PA1 (c) a superposing unit for superposing the signal segment with a further segment signal to form an output signal of the device. PA1 (a) a combining unit for forming a combination of the first and the second audio equivalent signal, wherein there is a relative time position of the second audio equivalent signal with respect to the first audio equivalent signal such that, over time, during a first time interval only the first audio equivalent signal is active and during a subsequent second time interval only the second audio equivalent signal is active PA1 (b) a positioning unit for locating window positions for time windows with respect to the combination of the first and the second audio equivalent signal, the positioning unit feeding the window positions to PA1 (c) a segmenting unit for deriving segment signals from the first and the second audio equivalent signal by weighting the first and the second audio equivalent signal as a function of position in the corresponding windows, the segmenting unit feeding the segment signals to PA1 (d) a superposing unit for superposing selected segment signals to form an output signal of the device. PA1 (a) locating the second audio equivalent signal at a position in time relative to the first audio equivalent signal, the position in time being such that, over time, during a first time interval only, the first audio equivalent signal is active and in a subsequent second time interval only the second audio equivalent signal is active, PA1 (b) positioning a chain of mutually overlapping time windows with respect to the first and the second audio signal, and PA1 (c) synthesizing an output audio signal by chained superposition of segment signals derived from the first and/or the second audio equivalent signal by weighting the first and/or the second audio equivalent signal as a function of position in the time windows, PA1 (i) the windows are positioned incrementally, a positional displacement between adjacent windows in the first and the second time interval being substantially equal to a local pitch period length of the first and the second audio equivalent signal; and PA1 (ii) the position in time of the second audio equivalent signal is selected to minimize a transition phenomenon representative of an audible effect in the output signal between where the output signal is formed by superposing segment signals derived from either the first or the second time interval exclusively. PA1 (a) a positioning unit for locating a position for a time window with respect to the audio equivalent signal, the positioning unit feeding the position to PA1 (b) a segmenting unit for deriving a segment signal from the audio equivalent signal by weighting the audio equivalent signal as a function of position in the window, the segmenting unit feeding the segment signal to PA1 (c) a superposing unit for superposing the signal segment with a further segment signal to form an output signal of the device PA1 (a) a combining unit, for forming a combination of the first and the second audio equivalent signal, wherein there is formed a relative time position of the second audio equivalent signal with respect to the first audio equivalent signal such that, over time, in the combination during a first time interval only the first audio equivalent signal is active and during a subsequent second time interval only the second audio equivalent signal is active PA1 (b) a positioning for locating window positions for time windows with respect to the combination of the first and the second audio equivalent signal; the positioning unit feeding the window positions to PA1 (c) a segmenting unit for deriving segment signals from the first and the second audio equivalent signal by weighting the first and the second audio equivalent signal as a function of position in the corresponding windows, the segmenting unit feeding the segment signals to PA1 (d) a superposing unit for superposing selected segment signals to form an output signal of the device,
The invention further relates to an apparatus for manipulating an audio equivalent signal. Such a device comprises:
The invention still further relates to an apparatus for manipulating a concatenation of a first and a second audio equivalent signal. Such a device comprises:
Such methods and apparatus are known from the European Patent Application No. 0363233. That application describes a speech synthesis system in which an audio equivalent signal, representing sampled speech, is used to produce an output (speech) signal. In order to obtain a prescribed prosody for synthesized speech, the pitch of the output signal and the durations of stretches (i.e. portions) of speech are manipulated. This is done by deriving segment signals from the audio equivalent signal, which in the prior art extend typically over two basic periods between periodic moments of the strongest excitation of the vocal cords.
To form, for example, an output signal with increased pitch, the segment signals are superposed, but not in their original timing relation. Rather their mutual center to center distance is compressed as compared to the original audio equivalent signal (leaving the length of the segment signal the same, but the pitch larger). To manipulate the length of a stretch, some segment signals are repeated or skipped during superposition.
The segment signals are obtained from windows placed over the audio equivalent signal. Each window in the prior art preferably extends to the center of the next window. In this case, each time point in the audio equivalent signal is covered by two windows.
To derive the segment signals, the audio equivalent signal in each window is weighted with a window function, which varies as a function of position in the window, and which approaches zero on the approach of the edges of the window. Moreover, the window function is "self complementary" in the sense that the sum of the two window functions covering each time point in the audio equivalent signal is independent of the time point. (An example, of a window function that meets this condition is the square of a cosine with its argument running proportionally to time from minus ninety degrees at the beginning of the window to plus ninety degrees at the end of the window).
As a consequence of this self complementary property of the window function, one would retrieve the original audio equivalent signal if the segment signals were superposed in the same time relation as they are derived. If, however, in order to obtain a pitch change of locally periodic signals (like, for example, voiced speech or music), before superposition, the segment signals are placed at different relative time points, the output signal will differ from the audio equivalent signal. In particular, it will have a different local period, but the envelope of its frequency spectrum will be approximately the same. Perception experiments have shown that this yields a very good perceived speech quality even if the pitch is changed by more than an octave.
The above-mentioned European patent describes the centers of the windows being placed at "voice marks", which are said to coincide with the moments of excitation of the vocal cords. That patent publication is silent as to how these voice marks should be found, although it states that a dictionary of diphone speech sounds with a corresponding table of voice marks is available from its applicant.
It is a disadvantage of the known method that voice marks, representing moments of excitation of the vocal cords, are required for placing the windows. Automatic determination of these moments from the audio equivalent signal is not robust against noise and may fail altogether for some (e.g., hoarse) voices, or under some circumstances (e.g., reverberated or filtered voices). Through irregularly placed voice marks, audible errors in the output signal occur. Manual determination of moments of excitation is a labor intensive process, only economically viable for speech signals which are used often as, for example, in a dictionary. Moreover, moments of excitation usually do not occur in an audio equivalent signal representing music.