Time-scale modification (TSM) of a signal refers to compression or expansion of the time scale of that signal. Within speech signals, the TSM of the speech signal expands or compresses the time scale of the speech, while preserving the identity of the speaker (pitch, format structure). As such, it is typically explored for purposes where alteration of the pronunciation speed is desired. Such applications of TSM include test-to-speech synthesis, foreign language learning and film/soundtrack post synchronisation.
Many techniques for fulfilling the need for high quality TSM of speech signals are known and examples of such techniques are described in E. Moulines, J. Laroche, “Non parametric techniques for pitch scale and time scale modification of speech”. In Speech Communication (Netherlands) Vol 16, No. 2 p175-205 1995.
Another potential application of TSM techniques is speech coding which, however, is much less reported. Within this application, the basic intention is to compress the time scale of a speech signal prior to coding, reducing the number of speech samples that need to be encoded, and to expand it by a reciprocal factor after decoding, to reinstate the original timescale. This concept is illustrated in FIG. 1. Since the time-scale compressed speech remains a valid speech signal, it can be processed by an arbitrary speech coder. For example, speech coding at 6 kbit/s could now be realised with a 8 kbit/s coder, preceeded by 25% time-scale compression and succeeded by 33% time-scale expansion.
The use of TSM in this context has been explored in the past, and fairly good results were claimed using several TSM methods and speech coders [1]-[3]. Recently, improvements have been made both to TSM and speech coding techniques, where these two have mostly been studied independently from each other.
As detailed in Moulines and Laroche, as referenced above, one widely used TSM algorithm is synchronised overlap-add (SOLA), which is an example of a waveform approach algorithm. Since its introduction [4], SOLA has evolved into a widely used algorithm for TSM of speech. Being a correlation method, it is also applicable to speech produced by multiple speakers or corrupted by background noise, and to some extent to music.
With SOLA, an input speech signal s is analysed as a sequence of N-samples long overlapping frames xi (i=0, . . . , m), consecutively delayed by a fixed analysis period of Sa, samples (Sa<N) The starting idea is that s can be compressed or expanded by outputting these frames while now successively shifting them by a synthesis period Ss, which is chosen such that Ss<Sa, respectively Ss>Sa (Ss<N). The overlapping segments would be first weighted by two amplitude complementary functions then added up, which is a suitable way of waveform averaging. FIG. 2 illustrates such an overlap-add expansion technique. The upper part shows the location of the consecutive frames in the input signal. The middle part demonstrates how these frames would be re-positioned during the synthesis, employing in this case two halves of a Hanning window for the weighting. Finally, the resulting time-scale expanded signal is shown in the lower part.
The actual synchronisation mechanism of SOLA consists of additionally shifting each xi during the synthesis, to yield similarity of the overlapping waveforms. Explicitly, a frame xi will now start contributing to the output signal at position iSs+ki, where ki is found such that the normalised cross-correlation given by Equation 1 is maximal for k=ki.
                                          R            i                    ⁡                      [            k            ]                          =                                                            ∑                                  j                  =                  0                                                  L                  -                  1                                            ⁢                                                                    s                    ~                                    ⁡                                      [                                                                  iS                        s                                            +                      k                      +                      j                                        ]                                                  ·                                  s                  ⁡                                      [                                                                  iS                        a                                            +                      j                                        ]                                                                                                      (                                                      ∑                                          j                      =                      0                                                              L                      -                      1                                                        ⁢                                                                                    s                        2                                            ⁡                                              [                                                                              iS                            a                                                    +                          j                                                ]                                                              ·                                                                  ∑                                                  j                          =                          0                                                                          L                          -                          1                                                                    ⁢                                                                                                    s                            ~                                                    2                                                ⁡                                                  [                                                                                    iS                              s                                                        +                            k                            +                            j                                                    ]                                                                                                                    )                                            1                /                2                                              ⁢                      (                          0              ≤              k              ≤                              N                /                2                                      )                                              (Equation  1)            
In this equation, {tilde over (s)} denotes the output signal while L denotes the length of the overlap corresponding to a particular lag k in the given range [1]. Having found ki, the synchronisation parameters, the overlapping signals are averaged as before. With a large number of frames the ratio of the output and input signal length will approach the value Ss/Sa, hence defining the scale factor α.
When SOLA compression is cascaded with the reciprocal SOLA expansion, several artefacts are typically introduced into the output speech, such as reverberation, artificial tonality and occasional degradation of transients.
The reverberation is associated with voiced speech, and can be attributed to waveform averaging. Both compression and the succeeding expansion average similar segments. However, similarity is measured locally, implying that the expansion does not necessarily insert additional waveform in the region where it was “missing”. This results in waveform smoothing, possibly even introducing new local periodicity. Furthermore, frame positioning during expansion is designed to re-use same segments, in order to create additional waveform. This introduces correlation in unvoiced speech, which is often perceived as an artificial “tonality”.
Artefacts also occur in speech transients, i.e. regions of voicing transition, which usually exhibit an abrupt alteration of the signal energy level. As the scale factor increases, so does the distance between ‘iSa’ and ‘iSs’ which may impede alignment of similar parts of a transient for averaging. Hence, overlapping distinct parts of a transient causes its “smearing”, endangering proper perception of its strength and timing.
In [5], [6], it was reported that a companded speech signal of a good quality can be achieved by employing the ki's that are obtained during SOLA compression. So, quite opposite to what is done by SOLA, the N-samples long frames {circumflex over (x)}i would now be excised from the compressed signal {tilde over (s)} at time instants iSs+ki and re-positioned at the original time instants iSa (while averaging the overlapping samples similar as before). The maximal cost of transmitting/storing all ki's is given by Equation 2, where Ts, is the speech sampling period and ┌ ┐ represents the operation of rounding towards the nearest-higher integer.
                              BR          k                =                              (                                          1                                                      S                    a                                    ·                                      T                    s                                                              ⁢                              frames                sec                                      )                    ·                      (                                          ⌈                                                      log                    2                                    ⁡                                      (                                          N                      2                                        )                                                  ⌉                            ⁢                              bits                frame                                      )                                              (Equation  2)            
It has also been reported that exclusion of transients from high (i.e. >30%) SOLA compression or expansion yields improved speech quality. [7]
It will be appreciated therefore that presently several techniques and approaches exist that can successfully (e.g. giving good quality) be employed for compressing or expanding the time-scale of signals. Although described specifically with reference to speech signals, it will be appreciated that this description is of an exemplary embodiment of a signal type and the problems associated with speech signals are also applicable to other signal types. When used for coding purposes, where the time-scale compression is followed by time-scale expansion (time-scale companding), the performance of prior art techniques degrade considerably. The best performance for speech signals is generally obtained from time-domain methods, among which SOLA is widely used, but problems still exist using these methods, some of which have been identified above. There is, therefore, a need to provide an improved method and system for time scale modifying a signal in a manner specific to the components making up that signal.