Sometimes it is desirable to control the speed at which a sound recording is played, such as messages played back using an answering machine or service; messages received using a network device (e.g., Internet based audio streaming); in speech learning tools for the hard of hearing and hearing aids; and in tape recorders and the like.
Conventional methods for processing sound signals whose speed has been altered are based on either time-domain or frequency-domain techniques. In general, time-domain techniques are used to process sounds generated from conversations or speech while frequency-domain techniques are used to process sounds generated from music. Efforts to use time-domain techniques on music have resulted in less than satisfactory results because music is “polyphonic” and, therefore, cannot be modeled using a single pitch, which is the underlining basis for time-domain techniques. Likewise, efforts to use frequency-domain techniques to process speech have also been less than satisfactory because they add a reverberant quality, among other things, to speech-based signals.
Attempts have been made to minimize the side-effects of frequency-domain techniques but they have resulted in limited improvements in sound quality. See for example, J. Laroche, “Improved phase vocoder time-scale modification of audio,” IEEE Trans. on Speech and Audio Proc., Vol. 7, no. 3, pp. 323-332, May 1999.
Other advances, mainly in time-domain based, time-scaling techniques have used the fact that speech signals can be separated into various types of signal “portions” those being “non-stationary” (sounds such as ‘p’, ‘t’, and ‘k’) and “stationary” portions (vowels such as ‘a’,‘u’,‘e’ and sounds such as ‘s’, ‘sh’). Conventional time-domain systems process each of these portions in a different manner (e.g., no time-scaling for short non-stationary portions). See for example E. Moulines, J. Laroche, “Non-parametric techniques for pitch-scale and time-scale modification of Speech”, Speech Commun., vol 16, pp. 175-205, February 1995. However, similar alterations of the time-scaling process based on the stationary features of a sound signal have not yet found their way into frequency-domain systems. As in time domain systems, frequency-domain systems should process non-stationary signal portions in a different manner than stationary portions in order to achieve improvements in sound quality.
For example, time-domain systems process non-stationary portions in small increments (i.e., the entire portion is broken up into smaller amounts so it can be analyzed and processed) while stationary portions are processed using large increments. The phrase “frame-size” is used to describe the number of signal samples that are processed together at a given time.
Conventional frequency-domain techniques use a fixed frame-size and do not alter the frame-size based on signal characteristics. By failing to alter the frame size or to otherwise vary the type of time-scaling used to process non-stationary signal portions, sound quality is sacrificed.
Accordingly, it is desirable to provide methods and devices for selectively generating time-scaled sound signals in order to provide improvements in sound quality.
It is a further desire of the present invention to provide methods and devices for selectively generating sound signals which combine the advantages of both time and frequency-domain processed signals.
It is yet an additional desire of the present invention to provide methods and devices for removing unwanted reverberant sound qualities in frequency-domain processing.
Further desires of the present invention will be apparent from the drawings, detailed description of the invention and claims which follow.