Voice processing, storage, and transmission often require identification of periods of silence. In a telephone answering system, for example, it may be necessary to determine when a caller stops talking in order to offer the caller additional options, to hang up on the caller, or to delimit a segment of the caller's speech before sending the speech segment to a voice (speech) recognition processor. As another example, consider the use of a speakerphone or similar multi-party conferencing equipment. Silence has to be detected so that the speakerphone can switch from a mode in which it receives audio signals from a remote caller and reproduces them to the local caller, to a mode in which the speakerphone receives sounds from the local caller and sends the sounds to the remote caller, and vice versa. Silence detection is also useful when compressing speech before storing it, or before transmitting the speech to a remote location. Because silence generally carries no useful information, a predetermined symbol or token can be substituted for each silence period. Such substitution saves storage space and transmission bandwidth. When lengths of the silent periods need to be preserved during reproduction—as may be the case when it is desirable to reproduce the speech authentically, including meaningful pauses—each token can include an indication of duration of the corresponding silent period. Generally, the savings in storage space or transmission bandwidth are little affected by accompanying silence tokens with indications of duration of the periods of silence.
In an ideal environment, a silence detector can simply look at the energy content or amplitude of the audio signal. Indeed, many silence detection methods often rely on energy or amplitude comparisons of the signal to one or more thresholds. The comparison can be performed on either broadband or band-limited signal. Ideal environments, however, are hard to come by: noise is practically omnipresent. Noise makes simple energy detection methods less reliable because it becomes difficult to distinguish between low-level speech and noise, particularly loud noise. Proliferation of mobile communication equipment—cellular telephones —has aggravated this problem, because telephone calls originating from cellular telephones tend to be made from noisy environments, such as automobiles, streets, and shopping malls. Engineers have therefore looked at other sound characteristics to distinguish between “noisy” silence and speech.
One characteristic helpful in identifying periods of silence is the average number of signal zero crossings in a given time period, also known as zero-crossing rate. A zero crossing takes place when the signal's waveform crosses the time axis. Zero-crossing rate is a relatively good spectral measure for narrowband signals. While speech energy is concentrated at low frequencies, e.g., below about 2.5 KHz, noise energy resides predominantly at higher frequencies. Although speech cannot be strictly characterized as narrowband signal, low zero-crossing rate has been observed to correlate well with voiced speech, and high zero-crossing rate has been observed to correlate well with noise. Consequently, some systems rely on zero-crossing rate algorithms to detect silence. For a fuller description of the use of zero-crossing algorithms in silence detection, see LAWRENCE R. RABINER & RONALD W. SCHAFER, DIGITAL PROCESSING OF SPEECH SIGNALS 130-35 (1978).
Other systems combine energy detection with zero-crossing algorithm. Still other systems use different spectral measures, either alone or in combination with monitoring signal energy and amplitude characteristics. But whatever the nature of the specific silence detector implementation, it generally reflects some compromise, minimizing either the probability of non-detection of silence, or the probability of false detection of silence. None appears to be a perfect replacement for human ear and judgment.
In many applications, reliable and robust detection of silence is an important performance parameter. In a telephone answering system, for example, it is important not to cut off a caller prematurely, but to allow the caller to leave a complete message and exercise other options made available by the answering system. False silence detection can lead to prematurely dropped telephone calls, resulting in loss of sales, loss of goodwill, missed appointments, embarrassment, and other undesirable consequences.
A need thus exists for reliable and robust silence detection methods and silence detectors. Another need exists for telephone answering systems with reliable and robust silence detectors. A further need exists for voice recognition and other voice processing systems with improved silence detectors.