A need has long been recognized for a method which can take normal speech signals as input, and produce high-quality pitch-modified speech signals as output. Such a system could be used by the author of a screenplay, soundtrack, or other type of script to convey stress and intonational patterns that cannot easily be conveyed by text, and to highlight distinctions between interacting voices that make a script more easily understood. Other related uses include localized editing of pitch contours in motion picture dialog, voice quality control in foreign language dubbing, cartoon characters' speech, books on tape, and voice mail over the internet.
In other applications it is necessary to play back recordings with a variety of intonational characteristics, without unduly inflating the storage requirements of a system. A computer game that contains recordings of animated characters, for example, must be distributed over network connections of limited bandwidth or on media of limited capacity. Whether or not an audio recording has been compressed, providing separate storage for modified copies of an original is inefficient. Other examples include systems for hearing evaluation, public address, voice response over a telephone network, and dictation playback.
Pitch modification is particularly useful in the area of text-to-speech synthesis, or simply constrained-vocabulary speech synthesis. Since parametric speech synthesizers often produce a robotic, monotonous cadence that is difficult for many listeners to follow, there is a need in the art for alternative methods of controlling the intonational characteristics of synthesized speech. A system such as an information retrieval engine or automotive navigational aid can produce a more natural-sounding output, by mapping the pitch and timing characteristics of concatenated recordings onto suprasegmental contours derived from the text.
In still other applications, low-delay systems for voice signal modification permit real-time interaction with an audience. Examples of this include Karaoke or other live musical performances, comedic acts, adjustments to the voices of television and radio announcers, and the disguising of protected witnesses' voices in courtroom proceedings or television interviews. Examples of one-to-one interaction include standardized voter polling and public opinion surveys, concealment of identity by law enforcement personnel, hearing aids, and restoration of helium speech.
In all of these applications, the most important factor for commercial acceptance is the subjective quality of the output signal. Previously developed techniques have produced a wide range of objectionable subjective qualities: reverberation, squeaking sounds, noiselike effects such as buzz and hiss, clicking sounds, hoarseness, irregularities in pitch, etc. Solutions which appear to work well in the music industry succeed because of the nature of musical signals. Musical signals tend to be highly periodic, high in amplitude compared to background aperiodic components, and sustained over relatively long periods of time. In addition, the fundamental frequencies of singing voices are often well into the range of the first two formants. By contrast, normal speech signals have strong unvoiced components, higher rates of articulation, and relatively low pitch. Thus, there is a need in the art for a method of high-quality signal modification that works on normal speech inputs
Previous methods for signal modification can be classified into three categories: 1) time-domain methods, 2) transform-domain methods which do not use matching processes, and 3) parametric or model-based methods.
Time-domain methods for "pitch shifting" generally perform the operation of frequency scaling, which does not correspond to the action of a person modulating his or her pitch frequency. To a first approximation, the short-time spectrum of a voice signal is the product of two components: the spectrum envelope, which is a smoothly varying outline of the various peaks and valleys in the spectrum, and the source spectrum, which contains finer-scale detail. FIG. 1 A represents the log magnitude of a short-time spectrum and its corresponding log magnitude envelope. In theory, the fine spectral features correspond to an acoustic excitation, ie vibrating vocal chords or air passing turbulently through a constriction of the vocal tract, while the spectrum envelope corresponds to the acoustic filtering action of the vocal tract. FIG. 1A represents the spectrum of an approximately periodic region of the speech wave, where the excitation is voiced.
A frequency scaling operation scales the entire spectrum in frequency, including the spectrum envelope. In contrast, the action of a person modulating his or her pitch frequency scales the spectrum envelope by only a small amount, due to minor changes in vocal tract length. For a speech signal, the term "pitch shifting" implies that the spectrum envelope remains approximately in place, while the characteristic size of fine spectral features is either increased or decreased. Frequency scaling does not hold the envelope in place, which results in the familiar "Mickey Mouse" or "Alvin the Chipmunk" effect. Nevertheless, some frequency scaling methods are able to produce remarkably noise-free outputs.
This success in producing noise-free outputs is due to the use of matching processes. In general, a time-domain matching process is a computation which measures the degree of similarity or dissimilarity between a reference time-domain segment and a set of candidate time-domain segments. A common example is the cross-correlation function, which can be defined as ##EQU1## where r[m]=y[n+m] w.sub.2 [m] is the reference segment, and c.sub.i [m]=x[n-i+m] w.sub.1 [m] is the ith candidate segment. w.sub.2 [m] is a windowing function that selects a region of the signal y[n] in the vicinity of time index n, and w.sub.1 [m] is a windowing function that selects a region of the signal x[n] in the vicinity of time index n-i.
The reference segment is sometimes obtained from the same signal as the candidate segments, in which case a cross-correlation function can be defined as ##EQU2## where r[m]=x[n+m] w.sub.2 [m] is the reference segment, and c.sub.i [m] is the ith candidate segment as before. The means of x[n] and y[n], or alternatively the means of c.sub.i [m] and r[m], are sometimes removed prior to the computation of C.sub.n [i]. A normalized cross-correlation can be obtained by dividing C.sub.n [i] by the square root of the energy product E.sub.r E.sub.ci, where ##EQU3## FIG. 3 A represents two windowing functions being applied to a time-domain input signal x[n]. FIG. 3B represents the corresponding candidate segment c.sub.i [m], and FIG. 3C represents the corresponding reference segment r[m]. Here, the windowing functions are symmetric and finite-duration. Agnello, U.S. Pat. No. 4,464,784 describes a signal modification method in which reference and candidate segments are obtained from the same signal.
FIG. 3D represents a windowing function being applied to a time-domain output signal y[n].
FIG. 3F represents the corresponding reference segment r[m], while FIG. 3E remains the same as FIG. 3B. Hejna et al., U.S. Pat. No. 5,175,769 describes a time-scale modification method in which the reference segment is obtained from a partially constructed output signal (methods for time-scale modification and frequency scaling can be interconverted by interpolation).
The second major category of previous methods contains transform-domain methods which do not use matching processes. A transform-domain representation of a signal can be obtained in a variety of ways. For audio signals, a commonly used method is the Short-Time Fourier Transform (STFT), which can be defined (from Rabiner and Schafer, "Digital Processing of Speech Signals," which is incorporated by reference) as ##EQU4## where x[n] is a time-domain input signal, w[n] is a windowing function such as a Hanning window, e.sup.-j.OMEGA.m =cos(.OMEGA.m)-jsin(.OMEGA.m) is a complex exponential basis function of frequency .OMEGA., and X(n,.OMEGA.) is the transform-domain representation. If n is considered fixed and .OMEGA. is considered variable, X(n,.OMEGA.) is the normal Fourier transform of the sequence w[n-m] x[m], and this is known as the "block transform" method. If .OMEGA. is considered variable and n is considered fixed, X(n,.OMEGA.) is the convolution of w[n] with x[n] e.sup.-j.OMEGA.n, and this is known as the "filter bank" method. Both methods sample the same function, X(n,.OMEGA.), and both provide transform-domain representations. In the absence of signal modifications, and given sufficiently high sampling rates in n and .OMEGA., both representations give back the original signal after inverse-transformation.
One problem with STFR representations, particularly in audio coding applications, is that they are not critically sampled: the total number of transform-domain samples needed for exact reconstruction is greater than the number of time-domain samples being represented. This had led to a variety of alternative transform-domain representations, including polyphase-structured filter banks, modulated filter banks, and tree-structured subband representations such as wavelets (Vetterli and Kovacevic, "Wavelets and Subband Coding," which is incorporated by reference). All of these representations can also be oversampled.
A transform-domain matching process is similar to a time-domain matching process, except that it measures the degree of similarity or dissimilarity between a reference transform-domain section and a set of candidate transform-domain sections. If X[n,k] is an STFT sampled at frequencies .OMEGA.=2.pi.k/N, k=0 . . . N-1, a matching function that is useful in block transform methods is ##EQU5## where r[b] is the reference section, c.sub.j [j] is the ith candidate section, and "dot" signifies the dot product between two complex values. w.sub.2 [b] is a windowing function that selects a region of Y[n,k] in the vicinity of frequency index k, and w.sub.1 [j] is a windowing function that selects a region of X[n,k] in the vicinity of frequency index k-i. In this case, C.sub.n,k [i] compares a reference section to candidate sections of variable center frequency index k-i, at fixed time index n. FIG. 4A shows the region of X[n,k] that is used in forming candidate sections c.sub.i [j]. Each circle in the figure represents a sample of X[n,k], and circles with a line through them are used in forming one or more candidate sections. Open circles represent the centers [n,k-i]. Each of the open circles corresponds to a sum of dot products according to Eq. (5), and to one particular value of i. Another matching function that is useful in block transform methods is ##EQU6## where W=e.sup.-J2.pi./N, c.sub.i [j] is the ith candidate section, and w.sub.1 [i] is a windowing function that selects a region of X[n,k] in the vicinity of time index n-i. As is well known in the art, a modulation by W.sup.-kn converts X[n,k], the fixed-time-reference quantity, into a sliding-time-reference quantity. In this case, C.sub.n,k [i] compares a reference section to candidate sections of variable time index n-i, at fixed center frequency k. FIG. 4B shows the region of X[n,k] that is used in forming candidate sections c.sub.i [j]. Open circles represent the centers [n-i,k]. Each of the open circles corresponds to a sum of dot products according to Eq. (6), and to one particular value of i. A matching function that is useful in filter bank methods is ##EQU7## where w.sub.2 [m] is a windowing function that selects a region of Y[n,k] in the vicinity of time index n, and w.sub.1 [m] is a windowing function that selects a region of X[n,k] in the vicinity of time index n-i. In this case, C.sub.n,k [i] compares a reference section to candidate sections of variable center time index n-i, at fixed frequency k. FIG. 4C shows the region of X[n,k] that is used in forming candidate sections c.sub.i [m]. Open circles represent the centers [n-i,k].
Each of the open circles corresponds to a sum of dot products according to Eq. (7), and to one particular value of i. Another matching function that is useful in filter bank methods is ##EQU8## where c.sub.i [m] is the ith candidate section, and w.sub.1, [m] is a windowing function that selects a region of X[n,k] in the vicinity of frequency index k-i. In this case, C.sub.n,k [i] compares a reference section to candidate sections of variable frequency k-i, at fixed center time index n. FIG. 4D shows the region of X[n,k] that is used in forming candidate sections c.sub.i [m]. Open circles represent the centers [n,k-i]. Each of the open circles corresponds to a sum of dot products according to Eq. (8), and to one particular value of i.
As with time-domain matching processes, other measures of similarity or dissimilarity are possible. One measure that has been used in some methods is a cross-correlation between magnitude spectra or power spectra. If X[n,k] and Y[n,k] represent transform-domain magnitudes, such a measure can be obtained from any of the above forms by removing any unit-magnitude modulations and replacing the dot product with scalar multiplication. In the case of positive-valued functions like magnitude and power spectra, another possibility is to take logarithms, remove the mean logarithm, and then use an absolute magnitude difference function (AMDF) or cross-correlation.
Several transform-domain methods which do not use matching processes have been described in the literature. In an article entitled "Phase Vocoder", J. L. Flanagan and R. M. Golden describe a method of constructing time-domain output from frequency-scaled phase derivative signals. This method causes phase relationships between different bands to be arbitrarily altered, and produces a characteristic type of reverberation. In an article entitled "System to Independently Modify Excitation and/or Spectrum of Speech Waveform Without Explicit Pitch Extraction," S. Seneff describes a method which divides out a spectrum envelope in the frequency domain, and then restores this envelope after frequency-scaling the excitation spectrum using phase vocoder methods. In an article entitled "A New Speech Modification Method By Signal Reconstruction," M. Abe et al use the iterative procedure of D. W. Griffin and J. S. Lim to approximate a magnitude spectrum condition obtained through homomorphic analysis. None of these methods utilize matching processes.
The third major category of previous methods is parametric or model-based methods. A typical modelbased approach is to deconvolve a signal using a model filter, such as the model filter defined by a set of Linear Prediction coefficients, frequency-scale the resulting source signal, and then form a time-domain output by passing the frequency-scaled source signal through the model filter. This approach produces many objectionable subjective qualities.
Another parametric method is described by D. W. Griffin and J. S. Lim in an article, "A New Model-Based Speech Analysis/Synthesis System". In this method, the pitch and spectrum envelope for each frame are determined by a matching process which compares model magnitude spectra to the observed magnitude spectrum. Another parametric method which uses a matching process is described by T. E. Quatieri and R. J. KMcAulay in an article entitled "Speech Transformations Based on a Sinusoidal Representation", and in U.S. Pat. No. 4,885,790. In this method, input signals are approximated by a set of sinusoids having time-varying amplitudes, frequencies, and phases. For each input frame, the frequencies of the model sinusoids are determined by peak-peaking methods. Such peaks are then connected from frame to frame using a matching process which incorporates a birth-death algorithm. In order to produce an output signal which approximates an unmodified input, the amplitudes, frequencies, and phases of the model sinusoids are interpolated from frame to frame. Pitch scaling and frequency scaling are provided by scaling the model sinusoids in frequency, with or without envelope compensation respectively.