An audio signal representing speech may convey information allowing a listener to identify certain characteristics about the speaker. For example, male speakers are commonly associated with lower pitched voices than female speakers. Similarly, some listeners may draw inferences about a speaker's race, age, emotional state or physical attractiveness from listening to an audio signal representing their voice. In certain situations, it may be desirable to prevent the listener from drawing such inferences. For example, when a recruiter listens to a prospective applicant speaking through a voice connection, it may increase the objectivity of the process if the recruiter is prevented from forming conclusions based on characteristics of the applicant's voice.
Because such inferences may be drawn on a subconscious level by some listeners, it may be difficult for those listeners to refrain from drawing such inferences even when the listener consciously wishes to do so. Accordingly, a system that prevents the listener from drawing such inferences without significantly impeding the effective verbal communication between the speaker and the listener is desirable.
While techniques for adjusting pitch without affecting the duration of a signal are well known, simple pitch shifting provides poor results for voice masking because certain patterns that human listeners rely on to understand the speech content of the signal may be disrupted.
Source-Filter Model
Without being limited by theory, it is believed that the sound of a speaker's voice is significantly determined by resonances of one fundamental frequency that is produced in the speaker's larynx. A variation of this fundamental frequency is generally perceived as change of pitch in the voice. The fundamental and resonant frequencies produced in the larynx are filtered by the speaker's vocal tract. Depending on the spoken phoneme, the speaker's vocal tract will emphasize some frequencies and attenuate others.
The human vocal system may thus be conceptualized using a source-filter model, wherein the source corresponds to the larynx and the filter corresponds to the vocal tract. The frequencies which are most strongly emphasized by the vocal tract during a particular period of vocalization are referred to as formant frequencies. When the vocal tract is viewed as a filter, these formant frequencies may be considered the peaks of the filter's transmission function.
The fundamental frequency of a human's larynx varies individually, and is correlated with the speaker's age, sex, and possibly with other characteristics. The formant frequencies and, more generally, the shape of the vocal tract's transmission function are believed to vary depending both on the spoken phoneme and individual characteristics of the speaker. Accordingly, both the fundamental frequency and the formant frequencies convey information about certain attributes of the speaker, and are interpreted by listeners accordingly.
One empirical study found that a typical male speaker speaking the sound “u”, as pronounced in “soot” or “tomb”, would exhibit a fundamental frequency of 141 Hz, with formants at 300 Hz, 870 Hz and 2240 Hz, respectively. Conversely, a typical female speaker pronouncing the same phoneme would have a fundamental frequency of 231 Hz, with formant frequencies at 370 Hz, 950 Hz, and 2670 Hz. A child pronouncing the same phoneme would have yet another different set of typical fundamental and formant frequencies. Peterson, et al., Control Methods Used in a Study of the Vowels, Journal of the Acoustical Society of America, Vol. 24, No. 2 (1952).
To transform a signal representing one speaker's voice into a signal with characteristics approximating a different speaker's voice, various methods have been proposed to adjust both the pitch (corresponding to the fundamental frequency of the source) and the formant frequencies (corresponding to the peaks of the filter's transmission function). E.g., Tang, Voice Transformations: From Speech Synthesis To Mammalian Vocalizations, EUROSPEECH 2001 (Aalborg, Denmark, Sep. 3-7, 2001). Some of these methods determine approximations of the source and filter components of a recorded voice signal, separately adjust them, and reconvolve them. For example, according to Tang, vocal source and vocal filter can be modelled as a convolution in the time domain representation of a recorded signal, which is equivalent to a multiplication in the frequency domain. By converting the signal into a frequency-domain representation using a discrete Fourier transform, and then converting the frequency-domain representation to polar coordinates, a magnitude spectrum can be determined. By then determining an envelope of the magnitude spectrum and dividing the magnitude spectrum by its envelope, an “excitation spectrum”, which can be viewed as an approximation of the source spectrum in the source-filter model, can be determined. Tang's approach is one of many frequency domain voice transformation techniques that rely on the basic “Phase Vocoder.” See Flanagan et al., Phase Vocoder, Bell System Technical Journal, November 1966.
Other literature has recognized that the use of discrete Fourier analysis and subsequent Fourier synthesis in the context of processing audible signals may require steps to compensate for the inherent discretization artifacts introduced by the methods. Specifically, Fourier analysis may introduce “frequency smearing”—discretization errors that occur when the signal includes frequencies that do not fully align with any frequency bin. This may lead to a number of effects undesirable in the context of audio processing, including, for example, interference effects between adjacent channels. The literature has also recognized that these effects can be reduced by appropriately relating the phase of the signal to the frequency of the frequency bin. Puckette describes the sound resulting from interference between adjacent frequency bins as “reverberant” and proposes a technique described as “phase locking”, or modifying the phase of the reconstructed signal so as to maximize the difference in phase between adjacent frequencies. Puckette, Phase-locked Vocoder, Proceedings of the 1995 IEEE ASSP Conference on Applications of Signal Processing to Audio and Acoustics (Mohonk, N.Y., Oct. 15-18, 1995).
Various methods have been proposed to determine the magnitude spectral envelope that is used when separating source and filter components of an input signal as described above. Tang suggests simple low-pass filtering. Robel suggests that it may be desirable to use alternative methods that give a more accurate representation of the spectral envelope. Robel et al., Efficient Spectral Envelope Estimation and Its Application to Pitch Shifting and Envelope Preservation, Proceedings of the Eighth International Conference on Digital Audio Effects (Madrid, Spain, Sep. 20-22, 2005). Robel specifically identifies a discrete cepstrum method and a true envelope method. According to Robel, the discrete cepstrum method may require extrinsic knowledge of the fundamental frequency. This may make utilizing the proposed method difficult for a system that is to be compatible with multiple users, since the fundamental frequency varies with the speaker's anatomy, and thus additional steps would have to be performed to determine the fundamental frequency before processing can be performed. The true envelope method does not require such knowledge but, as proposed, is an iterative algorithm that requires a Fourier analysis and a Fourier synthesis in each iteration.
Robel relies on a cepstrum, which is a Fourier transformation applied to the log of a spectrum. By analyzing the cepstrum, it is possible to separate out the effects of fundamental frequency and its harmonics generated by the larynx and the filtering from the vocal tract. Such separation may be explained by harmonic peaks in the acoustic spectrum emitted by the larynx being spaced closely compared to the peaks of the vocal tract's transmission function. Accordingly, peaks in the low-frequency range of the cepstrum can be attributed to filtering by the vocal tract, and peaks in the high-frequency range can be attributed to the source signal from the larynx.
Robel specifically discusses applying various types of filtering to the cepstrum, and explains that if the cepstrum from the recorded signal is subjected to low-pass filtering, it will approximate the cepstrum of the spectral envelope, and thus the cepstrum of the transmission function of the vocal tract. However, Robel also identifies problems related to inaccuracies introduced when using the low-pass filtered cepstrum to determine the spectral envelope. Robel therefore proposes an algorithm that, by iteratively refining the low-pass filtered cepstrum, may provide a better representation of the spectral envelope. But as Robel acknowledges, the proposed method requires “rather extensive computation, particularly where the FFT size is large”. This may make the proposed method difficult to implement as a real-time system, particularly on hardware with modest computational resources.
The techniques described above provide useful tools for voice transformation, but they are subject to constraints that may limit their utility for certain potential use cases. For example, it may be useful to provide high-quality voice masking in real-time communications over the Internet or other computer networks for purposes such as reducing bias when recruiting employees, as noted earlier. However, the literature described above does not address the implementation challenges that interfere with such use cases. Thus, there is a need for improvements that overcome those challenges.