(Not Applicable)
(Not Applicable)
1. Technical Field
This invention relates to the field of speech systems and more particularly to a method and apparatus for dynamically adjusting audio input gain according to conditions sensed in an audio input signal to a speech system.
2. Description of the Related Art
Speech systems are systems which can receive an analog audio input signal representative of speech and subsequently digitize and process the audio input signal into a digitized speech signal. Speech signals, unlike general audio signals, contain both speech data and silence data. That is, in any given sample of audio data representative of speech, a portion of the signal actually represents speech while other portions of the signal represent background noise and silence. Hence, in performing digital processing on an audio signal, a speech system must be able to differentiate between speech data and background and silence data. Accordingly, speech systems can be sensitive to the quality of an audio input signal in performing this necessary differentiation.
The quality of an audio input signal can be particularly apparent in a handheld, portable speech system. Specifically, users of portable speech systems often provide speech input to the speech system in varying environmental conditions. For example, a user of a portable speech system can dictate speech in car, in an office, at home in front of the television, in a restaurant, or even outside. Consequently, many environmental factors can affect the quality of speech input. When in a car, interior cabin noise can be included in the speech signal. When in an office, a ringing telephone can be included in the speech signal. When outside, the honking of a passing car can be included in the speech signal. As a result, the portion of a speech input which is to be interpreted as speech data can vary depending on what is to be interpreted as background xe2x80x9csilencexe2x80x9dxe2x80x94car honking, television programming, telephone ringing, interior cabin noise, or true silence.
The problem of speech signal quality in identifying speech data in a speech system can be compounded by the process of speech recognition. Speech recognition is the process of converting an acoustic signal, captured by transducer, for instance a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands and control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding. Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal.
First, the acoustic realizations of phonemes, the smallest sound units of which words are composed, are highly dependent on the context in which they appear. These phonetic variables are exemplified by the acoustic differences of the phoneme /t/ in two, true, and butter in American English. Second, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variables. Third, acoustic variables can result from changes in the environment as well as in the position and characteristics of the transducer. Finally, speaker variables can result from changes in the speaker""s physical and emotional state, speaking rate, or voice quality.
The speech recognition accuracy of a speech-to-text conversion system depends directly upon the quality of an audio input signal containing the speech data to be converted to text. Specifically, it is desirable for the amplitude of an audio input signal to fall within an optimal range. While the specific limits of the desired range can vary from speech recognition engine to speech recognition engine, all speech recognition engines can experience imperfect speech recognition performance when the amplitude of an audio input signal falls outside of an acceptable range.
Specifically, an audio input signal having an amplitude falling below an extremely low levelxe2x80x94an insufficient signalxe2x80x94can cause the degradation of speech recognition performance of a speech recognition engine. Correspondingly, an audio input signal having an amplitude exceeding an extremely high level can result in a saturated signal, a clipping condition as well as signal distortion. An insufficient or excessive audio signal can arise in response to a variety of conditions. For example, when providing speech input to a speech system, the speaker can move either the speaker""s head with respect to the microphone or the microphone with respect to the speakers head. Also, the speaker inadvertently can change the volume of the speaker""s voice or the input volume controlled by the audio circuitry used to receive the speech input audio signal.
When configuring a speech system, speech systems typically measure the characteristics of an audio input signal for a particular speaker using a particular microphone. Using these measured characteristics, the speech system can set system parameters to optimize the amplification and conditioning of the audio signal. Thus, in the case where different speakers provide audio input to the same speech system at different times, the speech system parameters can prove inadequate to accommodate the subsequent speaker for which the parameters had not been optimized. Likewise, in the case where different microphones are used at different times to provide speech audio input to the same speech system, the speech system parameters can prove inadequate to accommodate the second microphone for which the parameters had not been optimized. As a result, in either case, an insufficient or excessive audio signal condition can arise.
Present speech systems have yet to adequately address the problem of varying amplitudes of speech audio input signals. Specifically, what is needed is a method for monitoring the amplitude of a speech audio input signal during a speech session and adjusting the amplitude of the speech audio input signal accordingly. Hence, there exists a present need for dynamically adjusting audio input gain in a speech system.
A method for adjusting audio input signal gain in a speech system can include seven steps. First, an upper and a lower threshold can be predetermined in which the upper and lower threshold define an optimal range of audio data signal amplitude measurements. Second, a frame of unpredicted digital audio data samples can be received. In particular, the unpredicted digital audio data samples can be acquired by audio circuitry in a computer system. Significantly, the digital audio data samples received are not pre-scripted and are unknown to the computer system at the time of reception with regard to speech content.
Each sample can indicate an amplitude measurement of the audio data signal at a particular point in time. As such, third, a maximum signal amplitude can be calculated for a configurable measurement percentile of the unpredicted digital audio data samples. A measurement percentile is a selected percentage of samples in the digital audio data upon which computations are to be performed. For example, the calculation of the maximum signal amplitude for the ninety-eighth (98th) measurement percentile means the maximum signal amplitude for the first ninety-eight (98) percent of all samples in the frame.
Subsequent to the calculation of the maximum signal amplitude for the configured measurement percentile, fourth, the audio input signal gain can be incrementally adjusted downward if the maximum signal amplitude exceeds the upper threshold. Conversely, fifth, the audio input signal gain can be incrementally adjusted upward if the maximum signal amplitude falls below the lower threshold. Sixth, additional frames of unpredicted digital audio data samples can be received. Finally, seventh, each of the third through the sixth steps can be repeated with the received additional frames until the calculated maximum signal amplitude falls within the optimal range of audio signal amplitude.
In the one embodiment, in addition to the upper and lower thresholds, a full scale threshold can be predetermined above which a clipping condition is considered to have occurred. A clipping condition can be detected by first calculating a maximum signal amplitude for the digital audio data samples in the received frame. If the calculated maximum signal amplitude exceeds the full scale threshold, a downward adjustment can be calculated if necessary to bring the maximum signal amplitude within the optimal range. Subsequently, the audio input signal gain can be adjusted downward by the calculated downward adjustment. A clipping condition can also be determined by calculating a hypothetical signal peak amplitude. If the calculated hypothetical signal peak amplitude exceeds the full scale threshold, again, a downward adjustment can be calculated and performed if necessary to bring the hypothetical peak amplitude within the optimal range.
Notably, in another embodiment, a silence threshold can be calculated below which a quantity of digital audio data samples are interpreted as silence samples and above which a quantity of digital audio data samples are interpreted as speech samples. As a result of the calculation of a silence threshold, signal gain adjustments can occur only if the calculated maximum signal amplitude exceeds the silence threshold. Furthermore, in yet another embodiment, a silence timeout condition can be detected, the silence timeout condition occurring when no silence samples are received in a predetermined number of received frames. Responsive to detecting the silence timeout condition, the silence threshold can be increased by a proportional factor. Also, upon receiving an unpredicted frame of digital audio data samples having a maximum signal amplitude below the silence threshold, where as a result, the frame of digital audio data samples are interpreted as a frame of silence samples, a new silence threshold can be calculated based upon the maximum amplitude measurements of previously received silence samples. The new silence threshold can be calculated by first, storing a data set of previously received frames of silence samples, second, averaging the maximum amplitudes for each stored from in the data set, and, third, multiplying the average by a proportional factor.
Notably, two conditions can exist which have a bearing upon the calculation of a silence threshold in response to receiving silence samples in a frame. First, a clipping condition can exist in which samples exceeding the full scale threshold have been detected. Second, an initial condition can exist in which an adequate number of silence samples have yet been received in order to properly set the silence threshold. In either circumstance, a new silence threshold can be calculated based upon a maximum amplitude measurements of a second configurable measurement percentile of previously received speech samples. Specifically, the step of calculating a new silence threshold based upon maximum amplitude measurements of previously received speech samples can include storing a data set of previously received frames of speech samples and identifying a maximum amplitude for the second configurable measurement percentile of speech samples in each stored frame in the data set.
Significantly, the present invention can include histogram analysis techniques to identify whether the upper, lower and full scale thresholds have been breached. As a result, in a preferred embodiment of the present invention, an audio data histogram can be established. The audio data histogram can include a plurality of bins, each bin associated with a range of amplitude measurements and each bin having a corresponding counter. Each corresponding counter can be incremented in response to receiving a digital audio data sample having an amplitude measurement falling within an amplitude range associated with the corresponding bin. Thus, in response to receiving a digital audio data sample having an amplitude measurement falling within an amplitude range associated with a bin in the histogram, the counter associated with the bin can be incremented. Furthermore, the incrementing step can be repeated for each digital audio data sample in the frame, the repeating step populating the audio data histogram with histogram data derived from amplitude measurements of the digital audio data samples.
The audio data histogram can be used in the adjusting steps of the preferred embodiment. Specifically, the step of incrementally adjusting downward can include first specifying a measurement percentile of digital audio data samples in the histogram upon which an adjustment is determined. Second, a cumulative sum of counters in the histogram can be obtained. Specifically, the summation can begin with the zero-th bin in the histogram and can continue until reaching the i-th bin below which the cumulative sum, When compared to all samples in the histogram, corresponds to the specified measurement percentile. Third, a maximum signal amplitude corresponding to samples in the i-th bin can be calculated. The calculation can be based upon only those samples in the i-th bin which are included in the specified measurement percentile of digital audio data samples. Finally, fourth, the audio input signal gain can be incrementally adjusted downward if the calculated maximum signal amplitude corresponding to the samples in the i-th bin exceeds the upper threshold.
Similarly, the step of incrementally adjusting upward the audio input signal gain can include first specifying a measurement percentile of digital audio data samples in the histogram upon which an adjustment is determined. Second, a cumulative sum of counters in the histogram can be obtained. Specifically, the summation can begin with the zero-th bin in the histogram and can continue until reaching the i-th bin below which the cumulative sum, when compared to all samples in the histogram, corresponds to the specified measurement percentile. Third, a maximum signal amplitude corresponding to samples in the i-th bin can be calculated. The calculation can be based upon only those samples in the i-th bin which are included in the specified measurement percentile of digital audio data samples. Finally, fourth, the audio input signal gain can be incrementally adjusted upward if the calculated maximum signal amplitude corresponding to the samples in the i-th bin falls below the lower threshold.
Preferably, a data set of audio data histograms can be stored upon which histogram computations can be performed. Advantageously, by basing histogram computations on an average of histogram computations for all histograms in a data set, anomalous measurements can be diluted. In consequence, it can be determined if the data set has been populated with audio data histograms prior to the gain adjusting steps. If it is determined that the data set has not been populated, the gain adjusting steps preferably are not performed. Moreover, all audio data histograms in the data set can be discarded responsive to an audio gain adjustment.
In yet another embodiment, a silence data histogram can be incorporated. Like the audio data histogram, the silence data histogram can include a plurality of bins, each bin associated with a range of amplitude measurements and each bin having a corresponding counter. The corresponding counter can be incremented in response to receiving a silence sample having an amplitude measurement falling within an amplitude range associated with the corresponding bin. Furthermore, in response to receiving a silence sample having an amplitude measurement falling within an amplitude range associated with a bin in the histogram, the counter associated with the bin can be incremented. The incrementing step can be repeated for each silence sample in the frame, the repeating step populating the silence data histogram with histogram data derived from amplitude measurements of the silence samples.
Advantageously, the silence data histogram can be used in the step of calculating a new silence threshold. In that case, the calculating step can include storing a silence data set of silence data histograms and averaging maximum amplitudes for each histogram in the silence data set. Finally, the average can be multiplied by a proportional factor. The resulting value can be the new silence data threshold. As in the case of the data set of audio data histograms, however, it can be determined if the silence data set has been populated with silence data histograms prior to the silence threshold calculating step. If it is determined that the silence data set has not been populated, the silence threshold calculating step preferably is not performed. Moreover, all silence data histograms in the silence data set can be discarded in response to either an audio gain adjustment or the calculation of a new silence threshold.