This invention relates to identifying a presence of a voice in audio signals, for example, in a telephone network.
An audio signal can be any electronic transmission that conveys audio information. In a telephone network, audio signals include tones (for example, dual tone multifrequency (DTMF) tones, dial tones, or busy signals), noise, silence, or speech signals. Voice detection differentiates a speech signal from tones, noise, or silence.
One use for voice detection is in automated calling systems used for telemarketing. In the past, for example, a company trying to sell goods or services typically used several different telemarketing operators. Each operator would call a number and wait for an answer before taking further action such as speaking to the person on the line or hanging up and calling another prospective buyer. In recent years, however, telemarketing has become more efficient because telemarketers now use automatic calling machines that can call many numbers at a time and notify the telemarketer when someone has picked up the receiver and answered the call. To perform this function, the automatic calling machines must detect a presence of human speech on the receiver amid other audio signals before notifying the telemarketer. The detection of human speech in audio signals can be achieved using digital signal processing techniques.
FIG. 1 is a block diagram of a voice detector 10 that detects a presence of a voice in an audio signal. A time varying input signal 12 is received and a coder/decoder (CODEC) 14 may be used for analog-to-digital (A/D) conversion if the input signal is an analog signal; that is, a signal continuous in time. During A/D conversion, the CODEC 14 periodically samples in time the analog signal and outputs a digital signal 16 that includes a sequence of the discrete samples. The CODEC 14 optionally may perform other coding/decoding functions (for example, compression/decompression). If, however, the input signal 12 is digital, then no A/D conversion is needed and the CODEC 14 may be bypassed.
In either case, the digital signal 16 is provided to a digital signal processor (DSP) 18 which extracts information from the signal using frequency domain techniques such as Fourier analysis. Such frequency-domain representation of audio signals greatly facilitates analysis of the signal. A memory section 20 coupled to the DSP 18 is used by the DSP for storing and retrieving data and instructions while analyzing the digital audio signal 16.
FIG. 2A shows an example of a human speech audio signal 22 represented as an analog signal that may be input into the voice detector 10 of FIG. 1. Furthermore, FIG. 2B shows a digital signal 24 that corresponds to the input analog signal after it has been processed by the CODEC 14. In FIG. 2B, the analog signal of FIG. 2A has been sampled at a period .GAMMA. 26. Voiced sounds, such as those illustrated in region 28 of FIGS. 2A and 2B, generally result in a vibration of the human vocal tract and cause an oscillation in the audio signal. In contrast, unvoiced speech sounds, such as those illustrated in region 30 of FIGS. 2A and 2B, generally result in a broad, turbulent (that is, non-oscillatory), and low amplitude signal. The frequency domain representation of the human speech signal of FIG. 2B, for example, displays both voiced and unvoiced characteristics of human speech that may be used in the voice detector 10 to distinguish the speech signal from other audio signals such as tones, noise, or silence.
FIG. 3 is a flow chart of operation of the voice detector of FIG. 1. The voice detector 10 initially determines if the incoming audio signal 12 is digital in format (step 32). If the audio signal is digital, the voice detector 10 performs a discrete Fourier transform (DFT) analysis on the digitized signal (step 36). If, however, the audio signal is not digital, then the CODEC 14 samples the audio signal at a specified period to obtain a digital representation 16 of the audio signal (step 34). Then the voice detector 10 performs a DFT at step 36.
Parameters, such as frequency-domain maxima, are extracted from the signal (step 38) and are compared to predetermined thresholds (step 40). If the parameters exceed the thresholds, the voice detector 10 determines that the audio signal corresponds to a human voice, in which case the voice detector 10 reports the presence of the voice in the audio signal (step 42).
In step 38, the parameters extracted from the audio signal, such as the frequency-domain maxima, may, for example, correspond to formant frequencies in speech signals. Formants are natural frequencies or resonances of the human vocal tract that occur because of the tubular shape of the tract. There are three main resonances (formants) of significance in human speech, the locations of which are identified by the voice detector 10 and used in the voice detection analysis. Other parameters may be extracted and used by the voice detector 10.
Voice detection analysis is complicated by the fact that formant frequencies are sometimes difficult to identify for low-level voiced sounds. Moreover, defining the formants for unvoiced regions (for example, region 30 in FIGS. 2A and 2B) is impossible.