1. Field of the Invention
The present invention relates to a voice activity detector. It has particular utility in relation to an auxiliary voice activity detector comprised in a main voice activity detector and also when comprised in a noise reduction apparatus. A main voice activity detector incorporating such an auxiliary voice detector is especially suitable for use in mobile phones which may be required to operate in noisy environments.
2. Description of Related Art
Because of the limited regions of the electromagnetic spectrum which have been made available for use by cellular radio systems, the strong growth in the number of mobile phone users over the last decade has meant that cellular radio equipment suppliers have had to find ways to increase the efficiency with which the available electromagnetic spectrum is utilised.
One way in which this aim can be achieved is to reduce the size of the cells within the cellular radio system. However, it is found that cell size can only be reduced by so much before the level of interference from nearby cells (co-channel interference) becomes unacceptably high. In order to reduce co-channel interference, a technique called discontinuous transmission is used. This technique involves arranging the mobile phone to transmit speech-representing signals only when the mobile phone user is speaking and is based on the observation that, in a given conversation, it is usual for only one of the parties to speak at any one time. By implementing discontinuous transmission, the average level of co-channel interference can be reduced. This, in turn, means that the cell size in the system can be reduced and hence that the system can support more subscribers.
Another advantage of only transmitting sound-representing signals when the mobile phone user is speaking is that the lifetime of the electric battery within the mobile phone handset is increased.
A voice activity detector is used to enable discontinuous transmission. The purpose of such a detector is to indicate whether a given signal consists only of noise, or whether the signal comprises speech. If the voice activity detector indicates that the signal to be transmitted consists only of noise, then the signal is not transmitted.
Many mobile phones today use a voice activity detector similar to that described in European Patent No. 335521. In the voice activity detector described therein, the similarity between the spectrum of an input sound-representing signal and the spectrum of a noise signal is measured. The noise spectrum to be used in this comparison is obtained from earlier portions of the input signal which were determined to be noise. That judgement is made by an auxiliary voice activity detector which forms a component of the main voice activity detector. Since it is important that signals comprising speech are transmitted by the mobile phone and since the decision of the main voice activity detector is based on signals identified as noise by the auxiliary voice detector, it is desirable that the auxiliary voice detector tends, in borderline situations, towards a determination that the signal comprises speech. The proportion of a conversation which is identified as speech by a voice activity detector is called the voice activity factor (or simply xe2x80x9cactivityxe2x80x9d) of the detector. The proportion of conversation which in fact comprises speech is typically in the range 35% to 40%. So, ideally, a main voice activity detector will have art activity lying within this range or slightly above it, whereas an auxiliary voice activity detector can have a significantly higher activity.
Although the known voice activity detectors exhibit good performance in a variety of environments, their performance has been found to be poor in noisy environments. A mobile phone may be required to operate in cars, in city streets, in busy offices, in train stations or in airports. There is therefore a requirement for a voice activity detector that can operate reliably in noisy environments.
According to the first aspect of the present invention there is provided a voice activity detector comprising:
means arranged in operation to calculate at least one first spectral difference measure indicative of the degree of spectral similarity in a pair of time segments of a signal, one of the time segments of the pair lagging the other by a first time interval;
means arranged in operation to calculate at least one second spectral difference measure indicative of the degree of spectral similarity in a pair of time segments of a signal, one of the time segments of the pair lagging the other by a second time interval which differs from said first time interval;
means arranged in operation to calculate a spectral irregularity measure on the basis of at least said first and second spectral difference measures; and
means arranged in operation to compare said spectral irregularity measure with a threshold measure.
This voice activity detector has the advantage that it provides a reliable determination that an input signal consists of noise. As stated above, this is a desirable property for an auxiliary voice activity detector which is used to identify signals which are used as noise templates in other processes carried out in an apparatus. Also, by combining spectral difference measures derived in relation to different time intervals, a voice activity detector according to the present invention takes into account the degree of stationarity of the signal over different time intervals. For example, if a first spectral difference measure were to be calculated in relation to a first relatively long time interval and a second spectral difference measure were to be calculated in relation to a relatively short time interval, then both the short-term and long-term stationarity of the signal would influence a spectral irregularity measure which combines the first and second spectral difference measures. Since the spectrum of noise, unlike speech, is stationary at least over time intervals ranging from 80 ms to 1 s, the voice activity detector of the present invention provides a robust performance in noisy environments.
Preferably, the predetermined length of time is in the range 400 ms to 1 s. This has the advantage that the relatively rapidly time-varying nature of a speech spectrum can be best discriminated from the relatively slowly time-varying nature of a noise spectrum.
Preferably, said spectral irregularity measure calculating means are arranged in operation to calculate a weighted sum of said spectral difference measures. This has the advantage that, in making a speech/noise decision, more weight can be given to spectral difference measures derived from time intervals over which the difference in stationarity between speech spectra and noise spectra is most pronounced.
According to a second aspect of the present invention there is provided a voice activity detector including:
a voice activity detector according to the first aspect of the present invention operable as an auxiliary voice activity detector.
Since the auxiliary noise detector has a high activity, a determination that an input signal consists of noise can be relied on to be correct. Furthermore, because the correct functioning of the main voice activity detector relies on the auxiliary voice activity detector correctly identifying a noise signal, a voice activity detector according to the second aspect of the present invention makes a reliable determination of whether a signal comprises speech or consists only of noise.
According to a third aspect of the present invention there is provided a noise reduction apparatus comprising:
a voice activity detector according to the first aspect of the present invention;
means arranged in operation to provide an estimated noise spectrum on the basis of one or more spectra obtained from respective time segments determined to consist of noise by said voice activity detector; and
means arranged in operation to subtract said estimated noise spectrum from spectra obtained from subsequent time segments of said signal.
It is known by those skilled in the art that the technique of spectral subtraction only works well if the noise which is to be subtracted from the signal to be enhanced is stationary in its nature. This means that a combination of a spectral subtraction device and a voice activity detector according to the first aspect of the present invention forms a particularly effective noise reduction apparatus, since the operation of the voice activity detector according to the first aspect of the present invention means that an input signal will be determined to consist of noise only if that noise signal has been largely stationary within the predetermined length of time.
Generally, any apparatus which requires a reliable noise template will benefit from the inclusion of a voice activity detector according to the first aspect of the present invention.
According to a fourth aspect of the present invention, there is provided a voice activity detector comprising means arranged in operation to extract feature values from an input signal and neural net means arranged in operation to process a plurality of said feature values to output a value indicative of whether said input signal consists of noise.
An advantage of this apparatus is that a neural net, once trained, can model relationships between the input parameters and the output decision which cannot be easily determined analytically. Although the process of training the neural net is labour intensive, once the neural net has been trained, the computational complexity of the algorithm is less than that found in known algorithms. This is of course advantageous in relation to a product such as a voice activity detector which is likely to be produced in large numbers.
Preferably, the input parameters to the neural net include cepstral coefficients derived from the signal to be transmitted. It has been found that these are useful parameters in making the distinction between speech and noise.
According to a fifth aspect of the present invention there is provided a method of voice activity detection comprising the steps of:
calculating at least one first spectral difference measure indicative of the degree of spectral similarity in a pair of time segments of a signal, one of the time segments of the pair lagging the other by a first time interval;
calculating at least one second spectral difference measure indicative of the degree of spectral similarity in a pair of time segments of a signal, one of the time segments of the pair lagging the other by a second time interval which differs from said first time interval;
calculating a spectral irregularity measure on the basis of at least said first and second spectral difference measure;
comparing said spectral irregularity measure with a threshold measure; and
determining whether said signal consists of noise on the basis of the comparison.
This method has the advantage that the discrimination between noise and speech signals is robust.
According to a sixth aspect of the present invention there is provided a method of enhancing a spectrum representing the value of a spectral characteristic at a succession of predetermined frequencies, said enhancement comprising the steps of:
for each of said predetermined frequencies, comparing the value of said spectral characteristic at said frequency with the value of said characteristic at neighbouring frequencies and calculating an adjustment to said predetermined frequency spectral value, said calculation being such that the adjustment is increased on said predetermined frequency spectral value being greater than either of said neighbouring frequency spectral values and is decreased on said predetermined frequency spectral value being less than either of said neighbouring frequency spectral values; and
adjusting each of said spectral values within the spectrum in accordance with said calculated adjustment.