1. Field of the Invention
The present invention relates generally to the field of electronic audio effects. In particular, it relates to methods and systems for adjusting the pitch and sound of audio signals.
2. Discussion of the Related Art
Pitch shifting has a wide variety of applications in audio. For polyphonic music, pitch shifting can be used to change the key of a musical passage by one or more semitones up or down. Pitch shifting can also be used on a scale smaller than one semitone in order to adjust intonation. This technique is valuable for mixing together different previously recorded segments of music which may be detuned from each other, or for correcting intonation problems in a performance. For monophonic (single pitch) musical sources, including speech, pitch shifting can be used for both of these applications as well as for adding harmonization to a melodic line.
The most common pitch shifting algorithms for audio signals are based on resampling. Resampling pitch shifters sample the input audio stream at one sampling rate, and output the sampled data at a different sampling rate. For shifting pitch upwards, the output sampling rate is higher than the input sampling rate; for shifting pitch downwards, the output sampling rate is lower than the input sampling rate. In order to preserve the time length of the signal, resampling pitch shifters divide the audio stream into short separate time segments (on the order of 200 mS) and recombine those segments with varying degrees of overlap after resampling the segments. For a given input sample rate, to preserve the time length of a signal, the amount of overlap between time segments will increase as the output sample rate decreases. Resampling pitch shifters can be used with previously recorded audio or in real time with some latency between input and output.
For single pitch harmonic musical sources, the pitch of a particular signal is associated with a fundamental frequency of the note which is defined as 1/T, where T is the time length of the signal's period. For example, the pitch known as A above middle C has a fundamental frequency of 440 Hz. The timbre of a musical note is associated with the harmonic structure of the note. Timbre is perceptually related to the "character" or "sound" of a note. It is timbre which distinguishes a man's voice from a woman's singing the same note, or the sound of a French horn from the sound of a trumpet. The relative weights of the harmonics present in a periodic signal are known collectively as its spectral envelope, and determine its timbre. For the case of human voice signals, if the spectral envelope of a signal retains its shape but is stretched along the frequency axis, the resulting signal will sound "deeper" or "bigger" than the original, but will have the same vowel sound. If, on the other hand, the spectral envelope is compressed while keeping the same shape, the resulting signal will sound "thinner" or "smaller" than the original, again with the vowel sound retained.
Resampling pitch shifters scale every frequency present in a signal by a constant factor. For example, if a signal is shifted up an octave by a resampling pitch shifter, every frequency present in the original signal will appear at double the frequency in the output signal. This means that not only will the pitch of the output signal be an octave higher than the original signal, but the spectral envelope will be stretched by a factor of two with respect to the original. Similarly a signal which is pitch shifted down will have its spectral envelope compressed. Thus, the timbre of a signal is altered by a resampling pitch shifter.
FIG. 1 shows time domain waveforms for a harmonic signal. As can be seen, signal 106 is a time-stretched version of signal 102. The period 108 of signal 106 is longer than the period 104 of signal 102. Thus the pitch of signal 106 is lower than the pitch of signal 102. Since the features of sianal 106 are time-stretched compared to those of signal 102, the timbre of signal 106 is "deeper" than that of signal 102. Signal 110 is a time-compressed version of signal 102. The period 112 of signal 110 is shorter than the period 104 of signal 102. Thus the pitch of signal 110 is higher than the pitch of signal 102. Since the features of signal 110 are time-compressed relative to those of signal 102, the signal 110 has a "thinner" timbre than signal 102. For both altered signals, the spectrum is compressed or stretched by an amount determined by the amount of pitch modification.
For many audio signal processing applications it is desirable for the timbre of a sound to change as its pitch changes. For example, a trumpet sound shifted down by an octave will fall in the musical pitch range common for a trombone. If the pitch shift is accomplished with a resampling pitch shifter, the spectral envelope will be compressed by a factor of two, which will result in a timbre similar to that of a trombone. The overall effect of the resampling pitch shift will then be to "transform" the sound of the trumpet note to a sound that resembles a trombone tone both in pitch and timbre. This same fortunate circumstance applies to many musical instruments. A notable exception is the human voice.
The human voice has the unique feature that over a wide range of pitch, the timbre of the voice remains similar. Moreover, the human ear is attuned to human voice signals, so small changes in timbre have a large perceptual effect when dealing with human voices. Changes in the shape of the spectral envelope are perceived as changes in vowel sounds, while, as mentioned above, stretching of the spectral envelope is perceived as a change in deepness of the voice. Unfortunately, the scale by which the spectral envelope of a human voice signal can be stretched and still sound human is small. As a result, pitch shifting by more than a small musical interval using a resampling pitch shifter results in an unnatural sound for human voice signals. For example, a human voice which is shifted down by half an octave using a resampling pitch shifter might be described as having a "Darth Vadar" quality, while a voice which is shifted up by half an octave using resampling might have a "chipmunk" quality.
Further compromising the usefulness of resampling pitch shifters for voice signals are the artifacts introduced by the recombination of the overlapping time segments. As each segment begins and ends, the amplitude of the output signal is increased and decreased. This results in amplitude modulation in the output. Also, while overlapping segments are added together, there are two sources of correlated data which are being combined. This results in comb filtering at the output. Thus, there are various kinds of distortion introduced by resampling pitch shifters, some of which are perceived as time domain artifacts and some of which appear as frequency domain filtering. Also, as mentioned above, resampling pitch shifters cannot work in real time without latency between the input and output signals.
Other processes exist for changing the pitch of an audio signal without changing the signal's spectral envelope. When applied to human voice signals, these processes are referred to as fixed-format pitch shifters. The most popular algorithm for fixed-format shifting is known as the Lent algorithm, or the pitch-synchronous overlap-add algorithm. The Lent algorithm requires the ability to periodically window the input signal in a synchronous fashion, i.e., the window length must be related to the pitch period of the input signal. This in turn requires that the input signal have a single pitch. In other words, Lent shifting is possible only for monophonic (single-pitch) sources.
The Lent pitch shifter, when applied to human voice, results in an output which has a different pitch than the input, but the same timbral characteristics. Harmonies generated by the Lent shifter will sound as though they were sung by the same person who sang the original notes, preserving the human quality of the voice. This is desirable in many circumstances.
The Lent shifter works as follows: The input signal is first applied to a pitch detector. There are several known methods of pitch detection, including autocorrelation methods and low-pass filter/zero crossing detector methods. A pitch detector suitable for use in a Lent shifter is available from Aureal Semiconductor, Inc. of Fremont, Calif. The pitch detector provides the period T of the harmonic input signal. The signal is then periodically windowed by a Hanning window or other suitable window of length greater than or equal to 2T. The exact window function used is not critical but it is desirable to use a window with small sidelobes. FIGS. 2a-2c show the windowing process. FIG. 2a is a continuous, infinite length time signal. FIG. 2b shows a window function whose length is equal to two periods of the signal in 2a. FIG. 2c shows the windowed signal, which is the product of the window function and the time signal. This signal is finite length, since the window function is only nonzero for a finite time.
The window spectrally smooths the signal, eliminating the fine structure of the spectrum. This removes any pitch associated with the input signal, and leaves only the spectral envelope or timbral information. The windowed data segments are recombined at a rate 1/T', where T' is the desired output period for the signal. This impresses the desired pitch on the windowed data. If T' is set to a constant, the output signal will have a fixed musical pitch. If on the other hand T' is computed as a fixed (fractional) multiple of T, the output pitch will be a fixed musical interval from the input pitch.
The resampling pitch shifter changes the pitch of a signal and stretches its spectral envelope, both by the same factor. The Lent shifter changes the pitch of a signal without changing the spectral envelope, or timbre. For some applications it is desirable to be able to process an audio signal to change its pitch and timbre independently. An example would be creating harmonies for a vocal melody whose timbre is similar but not identical to the timbre of the original melody. This would result in the accompanying harmony voices sounding like a different person, but still sounding human. The resampling pitch shifter and Lent shifter can be combined to create a device that gives independent control over the pitch and timbre of an input audio signal. Such a device is shown in FIG. 3. An audio input signal 301 is first routed to a resampler 307 where the timbre and pitch are adjusted producing an intermediate signal 305. Since a resampler is used, the fundamental frequency is modified by the same factor by which the spectral envelope is stretched. This intermediate signal is then sent through a Lent shifter 307 for adjusting the pitch of the signal. However, an output signal 309 from such a device retains the artifacts of both the resampler and the Lent shifter. In addition, each of the two pitch shifters in the system require separate memory and processing power which make the entire algorithm computationally expensive.
Therefore it would be desirable to have a pitch and timbre adjusting mechanism that does not have the overhead or expense of having a resampling step followed by a Lent shifting step. It would also be desirable to reduce artifacts introduced by signal processing. Finally, it would be desirable to minimize the latency from input to output of the algorithm. Small latencies are essential for any application which is used for real time performance, since any perceptible latency from input to output would be frustrating to a performer.