Speech intelligibility is usually expressed as a percentage of words, sentences or phonemes correctly identified by a listener or a group of listeners. It is an important measure of the effectiveness or adequacy of a communication system or of the ability of people to communicate effectively in noisy environments. Quality is a subjective measure, which reflects on individual preferences of listeners. The two measures are not correlated. In fact, it is well known that intelligibility can be improved if one is willing to sacrifice quality. It is also well known that improving the quality of a signal does not necessarily elevate its intelligibility. On the contrary, quality improvement is usually associated with loss of intelligibility relative to that of the signal. This is due to distortion that the signal undergoes in the process of enhancing it.
Communication devices such as mobile phones, headsets, telephones and so forth may be used in vehicles or in other areas where there is often a high level of background noise. A high level of local background noise can make it difficult for a user of the communication device to understand the speech being received from the receiving side in the communication network. The ability of the user to effectively understand the speech received from the receiver side is obviously essential and is referred to as the intelligibility of the received speech.
In the past, the most common solution to overcome the background noise was to increase the volume at which the speakers of communication device output speech. One problem with this solution is that the maximum output sound level that a phone's speaker can generate is limited. Due to the need to produce cost-competitive cell phones, companies often use low-cost speakers with limited power handling capabilities. The maximum sound level such phone speakers generate is often insufficient due to high local background noise.
Attempts to overcome the local background noise by simply increasing the volume of the speaker output can also result in overloading the speaker. Overloading the loudspeaker introduces distortion to the speaker output and further decreases the intelligibility of the outputted speech. A technology that increases the intelligibility of speech received irrespective of the local background noise level is needed.
Several attempts to improve the intelligibility in communication devices are known in the related art. The requirements of an intelligent system cover naturalness of the enhanced signal, short signal delay and computational simplicity.
During the past two decades, Linear Predictive Coding (LPC) has become one of the most prevalent techniques for speech analysis. In fact, this technique is the basis of all the sophisticated algorithms that are used for estimating speech parameters, for example, pitch, formants, spectra, vocal tract and low bit representations of speech. The basic principle of linear prediction states that speech can be modeled as the output of a linear time-varying system excited by either periodic pulses or random noise. The most general predictor form in linear prediction is the Auto Regressive Moving Average (ARMA) model where a speech sample of ‘s (n)’ is predicted from ‘p’ past predicted speech samples s (n−1), . . . , s(n−p) with the addition of an excitation signal u(n) according to the following equation 1:s(n)=Σk=1Paks(n−i)+G Σi=0qbiu(n−1)   Equation 1where G is the gain factor for the input speech and a.sub.k and b.sub.1 are filter coefficients. The related transfer function H (z) is given by following equation 2:H(z)=S(z)/U(z)   Equation 2
For an all-pole or Autoregressive (AR) model, the transfer function becomes as the following equation 3:H(z)=1/(1−Σk=1pakz−k)=1/A(z)   Equation 3
Estimation of LPC
Two widely used methods for estimating the LP coefficients exist: autocorrelation method and covariance method. Both methods choose the LP coefficients a.sub.k in such a way that the residual energy is minimized. The classical least squares technique is used for this purpose. Among different variations of LP, the autocorrelation method of linear prediction is the most popular. In this method, a predictor (an FIR of order m) is determined by minimizing the square of the prediction error, the residual, over an infinite time interval. Popularity of the conventional autocorrelation method of LP is explained by its ability to compute a stable all-pole model for the speech spectrum, with a reasonable computational load, which is accurate enough for most applications when presented by a few parameters. The performance of LP in modeling of the speech spectrum can be explained by the autocorrelation function of the all-pole filter, which matches exactly the autocorrelation of the input signal between 0 and m when the prediction order equals m. The energy in the residual signal is minimized. The residual energy is given by the following equation 4:E=Σn=−∞∞e2(n)=Σn=−∞∞[sn(n)−Σaksn (n−k)]2   Equation 4
The covariance method is very similar to the autocorrelation method. The basic difference is the length of the analysis window. The covariance method windows the error signals instead of the original signal. The energy E of the windowed error signal is given by following equation 5:E=Σn=<∞∞e2(n)=Σn=−∞∞e2(n)w(n)   Equation 5
Comparing autocorrelation method and covariance method, the covariance method is quite general and can be used with no restrictions. The only problem is that of stability of the resulting filter, which is not a severe problem generally. In the autocorrelation method, on the other hand, the filter is guaranteed to be stable, but the problems of parameter accuracy can arise because of the necessity of windowing the time signal. This is usually a problem if the signal is a portion of an impulse response.
Usually in environments with significant local background noise, the signal received from the receiving side becomes unintelligible due to a phenomenon called masking. There are several kinds of masking, including but not limited to, auditory masking, temporal masking, simultaneous masking and so forth.
Auditory masking is a phenomenon when one sound is affected by the presence of another sound. Temporal masking is a phenomenon when a sudden sound makes other sounds inaudible. Simultaneous masking is the inability of hearing a sound in presence of other sound whose frequency component is very close to desired sound's frequency component.
In light of the above discussion, techniques are desirable for enhancing receiver intelligibility.