The present invention relates to audio processing in general and more particularly to a method and apparatus for modifying the sound of a human voice.
There are several methods of modifying the perception of the human voice. One of the most common is performed in television and radio programs where an interviewees voice is disguised so as to conceal the identity of the interviewee. Such voice modification is typically done with a static filter that acts upon the analog voice signal that is input to a microphone or similar input device. The filter modifies the voice by adding noise, increasing pitch, etc. Another method of modifying one's voice (specifically over a telephone) is to use a similar filter as described above or a more primitive manner would be to use a handkerchief or plastic wrap covering the mouthpiece of the phone.
Applications, such as the Internet, are increasingly using voice for communication (separate from or in addition to text and other media). Normally this is done by digitizing the signal generated by the originator speaking into a microphone and then formatting that digitized signal for transmission over the Internet. At the receiving end, the digital signal is converted back to an analog signal and played through a speaker. Within limits, the voice played at the receiving end sounds like the voice of the speaker. However, in many instances there is a desire that the speaker's voice be disguised. On the other hand, the listener, even if not hearing the speaker's natural voice, wants to know the general characteristics of the person to whom he is talking. To disguise one's voice in an Internet application or the like, a static filter such as the one described above can be used. However, such modification usually results in a voice that sounds unhuman. Furthermore, it gives the listener no information concerning the person to whom he is listening.
Various systems for analyzing and generating speech have been developed. In terms of speech analysis, automatic speech recognition systems are known. These can include an analog-to-digital (A/D) converter for digitizing the analog speech signal, a speech analyzer and a language analyzer. Initially, the system stores a dictionary including a pattern (i.e., digitized waveform) and textual representation for each of a plurality of speech segments (i.e., vocabulary). These speech segments may include words, syllables, diphones, etc. The speech analyzer divides the speech into a plurality of segments, and compares the patterns of each input segment to the segment patterns in the known vocabulary using pattern recognition or pattern matching in attempt to identify each segment.
The language analyzer uses a language model, which is a set of principles describing language use, to construct a textual representation of the analog speech signal. In other words, the speech recognition system uses a combination of pattern recognition and sophisticated guessing based on some linguistic and contextual knowledge. For example, certain word sequences are much more likely to occur than others. The language analyzer may work with the speech analyzer to identify words or resolve ambiguities between different words or word spellings. However, due to a limited vocabulary and other system limitations, a speech recognition system can guess incorrectly. For example, a speech recognition system receiving a speech signal having an unfamiliar accent or unfamiliar words may incorrectly guess several words, resulting in a textual output which can be unintelligible.
One proposed speech recognition system is disclosed in Alex Waibel, "Prosody and Speech Recognition, Research Notes In Artificial Intelligence," Morgan Kaufman Publishers, 1988 (ISBN 0-934613-70-2). Waibel discloses a speech-to-text system (such as an automatic dictation machine) that extracts prosodic information or parameters from the speech signal to improve the accuracy of text generation. Prosodic parameters associated with each speech segment may include, for example, the pitch (fundamental frequency F.sub.0) of the segment, duration of the segment, and amplitude (or stress or volume) of the segment. Waibel's speech recognition system is limited to the generation of an accurate textual representation of the speech signal. After generating the textual representation of the speech signal, any prosodic information that was extracted from the speech signal is discarded. Therefore, a person or system receiving the textual representation output by a speech-to-text system will know what was said, but will not know how it was said (i.e., pitch, duration, rhythm, intonation, stress).
Speech synthesis systems also exist for converting text to synthesized speech, and can include, for example, a language synthesizer, a speech synthesizer and a digital-to-analog (I/A) converter. Speech synthesizers use a plurality of stored speech segments and their associated representation (i.e., vocabulary) to generate speech by, for example, concatenating the stored speech segments. However, because no information is provided with the text as to how the speech should be generated (i.e., pitch, duration, rhythm, intonation, stress), the result is typically an unnatural or robot sounding speech. As a result, automatic speech recognition (speech-to-text) systems and speech synthesis (text-to-speech) systems may not be effectively used for the encoding, storing and transmission of natural sounding speech signals. Moreover, the areas of speech recognition and speech synthesis are separate disciplines. Speech recognition systems and speech synthesis systems are not typically used together to provide for a complete system that includes both encoding an analog signal into a digital representation and then decoding the digital representation to reconstruct the speech signal. Rather, speech recognition systems and speech synthesis are employed independently of one another, and therefore, do not typically share the same vocabulary and language model.
Accordingly, there is a need for a method and apparatus that allows for the modification of voice that results in a natural sounding output that conceals the identity of the person speaking. There is also a need for a method and apparatus that allows for detection of user-specific and non user-specific qualities of the person speaking.