Audio signals, and especially audio signals containing music or speech, have two properties that share a tight inter-dependency. These two properties are the tempo, or duration of an audio signal, and the pitch. If the tempo of an audio signal is changed by altering the playback sampling rate of the audio signal, such that there is a change in the speed of playback of the audio signal, then there will be a corresponding change to the pitch of the audio signal.
Time-scale modification (TSM) of audio allows for the alteration of the tempo, or duration, of an audio signal without changing the pitch of any tonal components in the relevant part of the audio signal. This ability to alter the tempo and pitch allows the tight inter-dependency of these two properties to be relaxed. This results in the overall effect of speeding up or slowing down the perceived playback rate, or tempo, of a recorded audio signal without affecting the perceived pitch or timbre of the original audio signal.
TSM allows the duration, or tempo, of the original signal to be increased or decreased while the perceptually important features of the original signal remain significantly unchanged. For example, in the case of speech, the time-scaled audio signal can sound as if the original speaker has spoken at a quicker or slower rate, or in the case of music, the time-scaled signal can sound as if the musicians have played at a different tempo but with unaltered pitches throughout the audio signal.
TSM algorithms can also be used to achieve key shifting, or a change in the pitch of an audio signal, without altering the tempo, or perceived playback rate, of the voice or music. Key shifting, or the change of pitch, can be achieved by changing the playback sampling rate of the tempo changed TSM-processed audio signal so that there is no significant change in tempo but the pitch and formants would be shifted.
For example, an original 1.0 second long audio signal with pitch frequency of 800 Hz is sampled at 8 kHz. TSM can be used to speed up the audio signal by 20% so that the output audio signal is 0.8 second long and has a pitch frequency that stays at 800 Hz if the playback sampling rate remains 8 kHz. However, if the playback sampling rate is slowed down to 6.4 kHz, then the output audio signal would be back to 1.0 second long but the pitch frequency would be lowered to 640 Hz. The pitch would be perceptibly lower than in the original audio signal. In this example the slower playback sampling rate can be achieved by either physically changing the sampling rate of a digital-to-audio converter (DAC) to 6.4 kHz, or resampling (or stretching) the signal digitally by a 1:1.2 ratio while keeping the DAC at 8 kHz.
The key-shifting feature is popular in applications such as karaoke, where singers can move the pitch range of a song so that he or she can follow the song more easily.
Transforming audio to an alternative time-scale is a digital audio effect that has become a standard tool within many audio multi-progressing applications. For example, Donnellan et al (“Speech-adaptive time-scale modification for computer assisted language-learning”, The 3rd IEEE International Conference on Advanced Learning Technologies, pp. 165-169, July 2003) describes applying a time-scale modification algorithm to natural-speed, native speech to aid students in learning a foreign language. This document discusses the merits of slowing down samples for use in computer-assisted language-learning. It describes using a TSM algorithm called synchronised overlap-add to extend the duration of sounds within the audio signal. This document also details varying the scaling of the speech within the audio signal such that the speech sounds natural after extension.
Amir et al (“Using audio time scale modification for video browsing”, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences, pp. 1117-1126, January 2000) details converting video/audio files to provide fast video browsing and uses a TSM algorithm to increase the speed of the speech present in the video/audio file. This increased speed audio content is then combined with a slide show of individual frames from the video content to enable the user to review the video/audio file in a shortened amount of time, whilst still understanding all of the audio present in the file.
Wong et al (“Fast time scale modification using envelope-matching technique (EM-TSM)”, Proceedings of the IEEE International Symposium on Circuits and Systems, vol. 5, pp. 550-553, 1998) describes modifying the synchronized overlap-and-add TSM algorithm to include envelope matching with the intention of decreasing the computation complexity of the algorithm. The envelope matching TSM algorithm used in this document is tested on an audio clip including a male voice and a song with background music.
In Macon et al (“Speech Concatenation and Synthesis Using an Overlap-add Sinusoidal Model”, IEEE International Conference on Acoustic, Speech and Signal Processing, vol. 1, pp. 361-364, May 1996) a TSM algorithm is used in a text-to-speech system. The speech audio signal is generated from the concatenation of short speech units taken from a pre-recorded library. A TSM algorithm is used to modify these short speech units to modify the duration and pitch so that they can be joined together smoothly to imitate natural speech.
It would be desirable to employ TSM algorithms in a more conventional mobile playback situation.