1. Field of the Invention
This invention relates to data compression schemes for digital speech processing systems. More particularly, it relates to the minimization of voice storage requirements for a voice messaging system by improving the efficiency of the speech compression.
2. Background of Related Art
Voice processing systems that record digitized voice messages generally require significant amounts of storage capacity. The amount of memory required for a given time unit of a voice message typically depends on the sampling rate. For instance, a sampling rate of 8,000 eight-bit samples per second yields 480,000 bytes of data for each minute of a voice message using linear, .mu.-law or A-law encoding or compression. Because of these large amounts of data, storage of linear, .mu.-law or A-law compressed speech samples is impractical in most instances. Accordingly, most digital voice messaging systems employ speech compression or speech coding techniques to reduce the storage requirements of voice messages.
A common speech encoding/compression algorithm used for speech storage is code excited linear predictive (CELP) based coding. CELP-based algorithms reconstruct speech signals based on a digital model of the human vocal tract. They provide frames of an encoded, compressed bit stream and include short-term spectral linear predictor coefficients, voicing information and gain information (frame and sub frame-based) reconstructable based on a model of the human vocal tract. Whether speech compression can or should be employed often depends on the desired quality of the speech upon reproduction, the sampling rate of the real-time speech, and the available processing capacity to handle speech compression and other associated tasks on-the-fly before storage to voice message memory. CELP bit rates vary, e.g., up to 6.8 Kb/s or more.
One technique used to further maximize the data compression of voice messages eliminates the encoding of portions corresponding to silence, pauses or just background noise in the real-time voice message. In the past, compression of silence periods in stored speech has been attained by removing each frame of compressed speech determined on-the-fly to contain only silence, pauses or background noise in speech. This analysis requires a significant portion of processing capability to occur simultaneously with other processes such as the encoding of the voice message.
Unfortunately, removal of frames of silence on-the-fly may undesirably introduce clipping of initial or final portions of spoken words. This clipping is irreversibly lost as the on-the-fly decisions made by these conventional systems are irreversible. Also, there is a finite look-ahead capacity of the processor relative to the incoming voice signal, e.g., a look up of only the current CELP frame of approximately 20 to 25 milliseconds (mS). As a result, the quality of reproduced speech which was silence compressed on-the-fly may be undesirably decreased.
A digital signal processor (DSP) or other processor is conventionally used to compress a voice signal into compressed digital samples in real-time or near real-time to reduce the amount of storage required to store the voice message. In some conventional systems, the DSP also performs speech analysis to ascertain and suppress silence or pause periods in the speech signal before encoding and storage of the voice message. However, in prior art systems the speech analysis is performed in real-time along with the compression of the voice message, requiring a powerful processor to handle the tasks of both speech compression and speech analysis simultaneously.
FIG. 3 illustrates the clipping of a portion of a real-time speech signal in more detail. FIG. 3 shows a real-time speech signal 402 with respect to a threshold noise level 400 determined by a conventional, real-time, time domain-based speech analysis. The threshold noise level 400 represents the maximum level of background noise or other unwanted information in speech signal 402, determined on a real-time basis from past speech only. Those portions of the speech signal 402 having levels above threshold noise level 400 are encoded and stored. However, speech samples that would otherwise be generated during silence periods or pauses in the real-time speech signal 402 lying below the threshold noise level 400 are discarded and replaced with the storage of a variable indicating a length of time and level of the silence period or pause.
Encoding and storage of compressed samples of the voice message resumes after it is determined that the silence period or pause has been interrupted by a signal above the threshold noise level 400. The threshold level 400 is adaptive to account for varying background noise levels. An analysis of the real-time speech signal 402 and determination of the exact point in time to resume encoding and storage of samples after a silence period or pause requires a certain amount of processing time. Because the look-ahead range is limited during real-time processing to avoid introducing excessive delays and buffering, the voice messaging system might not encode and store a portion of the analog real-time speech signal 402 between the points t.sub.1 and t.sub.2 immediately after the analog real-time speech signal 402 exceeds the threshold noise level 400. Thus, a portion of the analog real-time speech signal 402 may be undesirably clipped from the stored voice message and replaced with silence.
Because the extent of processor loading to perform encoding or compression varies according to the nature of the voice signal and other factors, it is possible that at times the performance of both the compression and speech analysis processes may exceed processor capacity. When this happens, the system may forego speech analysis functions such as silence compression entirely, resulting in a lessened efficiency of the compression routines and an increased storage requirement for the compressed voice message.
FIG. 4 shows a conventional silence compression technique wherein real-time speech is analyzed and compressed on-the-fly based on the time-based detection of periods of silence.
In FIG. 4, real-time analog speech is analyzed in the time domain in a time domain analysis module 320, then presented to a speech/silence decision module 300. Speech/silence decision module 300 determines if the current real-time speech is above or below a particular noise threshold, which is determined by conventional on-the-fly time-domain techniques. If the current real-time speech is above the noise threshold, it is presumed that the speech is non-silence, and if it is below the noise threshold, it is presumed that the current speech signal is related to a period of silence. However, the on-the-fly time domain analysis of speech to determine periods of silence, background noise or pauses in speech performed in conventional systems suffers from poor performance under poor signal-to-noise (S/N) ratio conditions.
In particular, the real-time speech is input to speech encoder 302 for compression into CELP frames, which are stored in memory 304 of the voice messaging system. When the real-time speech signal contains voice or other audible sounds above the noise threshold level, the voice is compressed into frames of CELP encoded data by speech encoder 302, which are then stored in memory 304. However, when the speech/silence decision module 300 determines that the real-time speech contains only a pause or is otherwise below the currently determined noise threshold level, encoding by speech encoder 302 is paused and a counter is started which represents the number of CELP frames containing only silence. Once voice or other audible sounds above the threshold level appear in the real-time speech signal, the last value of the silence frame counter and level is stored in memory 304, speech encoder 302 is re-activated, and the storage of CELP encoded data frames in memory 304 resumes. The threshold of the background noise is updated in the update background noise level module 306. The speech/silence decision module 300, the speech encoder 302, and the update background noise level module 306 are all included within a DSP.
It is important to note that in conventional techniques, the noise threshold is determined based on current and past conditions, usually in the time domain, of the real-time analog speech signal, and can only affect future (not past) encoding of the real-time speech. Although spectral analysis methods are known, they require a significant amount or processing power and typically are not practical to implement in real-time, on-the-fly applications. Thus, if the noise floor suddenly drops, the speech/silence decision module 300 may not respond immediately and portions of non-silence real-time speech may be clipped. Similarly, if the noise floor suddenly rises, the determination of silence periods in the real-time speech may not be optimized fully.
There is a need for an efficient silence compression technique which properly and accurately discriminates speech from silence, particularly when the noise floor suddenly changes, and which does not overburden the processing ability of the voice messaging system.