1. Field of the Invention
The present invention relates to a voice activity detecting device for discriminating between an active voice segment and a non-active voice segment of the aural signal, and it also relates to a voice activity detecting method which is applied to the voice activity detecting device.
2. Description of the Related Art
In recent years, digital signal processing technologies have been highly progressed, and in a mobile communication system and other communication systems, these digital signal processing technologies are applied to perform various kinds of real time signal processing for an aural signal which is transmission information.
Furthermore, at a transmitting end of a communication system like the above, a voice activity detecting device for detecting an active voice segment and a non-active voice segment of the aforesaid aural signal and for allowing transmission to a transmission channel only in this active voice segment is mounted for the purpose of achieving compression of a transmission band and effective utilization of a radio frequency and saving power consumption.
FIG. 12 is a block diagram showing a configuration example of a radio terminal equipment in which the voice activity detecting device is mounted.
In FIG. 12, a microphone 41 is connected to an input of a voice activity detecting device 42 and a modulation input of a receiving/transmitting part 43, and a feeding point of an antenna 44 is connected to an antenna terminal of this receiving/transmitting part 43. An output of the voice activity detecting device 42 is connected to a transmission control input of the receiving/transmitting part 43, and to a control input/output of this receiving/transmitting part 43, a corresponding input/output port of a controlling part 45 is connected. A specific output port of the controlling part 45 is connected to a control input of the voice activity detecting device 42 and a demodulation output of the receiving/transmitting part 43 is connected to an input of a receiver 46.
In the radio terminal equipment as configured above, the receiving/transmitting part 43 radio-interfaces aural signals, which are transmission information to be transmitted/received via the microphone 41 and the receiver 46, with a radio transmission channel (not shown) which is accessible via the antenna 44.
The controlling part 45 plays a leading role in channel control which is required for forming this radio transmission channel by operating in association with the receiving/transmitting part 43.
The voice activity detecting device 42 samples the aforesaid aural signals at a predetermined cycle to generate a sequence of active voice frames. Moreover, the voice activity detecting device 42 discriminates, based on the characteristic of the aural signal, which of an active voice segment and a non-active voice segment each of the active voice frames corresponds to, and outputs a binary signal indicating the result of the discrimination.
Note that the aforesaid characteristic includes, for example, the following items. having a dynamic range of approximately 55 decibel Amplitude distribution can be approximated to by a standard probability density function. Values of energy density and a zero crossing frequency in the active voice segment are different from those in the non-active voice segment respectively.
The receiving/transmitting part 43 refrains from transmitting during a period when a logical value of the binary signal indicates the aforesaid non-active voice segment.
Therefore, unwanted transmission by the receiving/transmission part 43 is restricted during a period when any available information is not included as transmission information in the aural signal. Consequently, suppression of interference with other radio channel and effective utilization of a radio frequency as well as reduction in power consumption can be realized.
In the conventional example as described above, however, a difference in a feature value (for example, the aforesaid zero crossing frequency) between in the active voice segment and in the non-active voice segment becomes small during a period when noise of a high level is superimposed on the aural signal which is given via the microphone 41.
Furthermore, even in the active voice segment, amplitude of the aural signal is generally distributed more at small values compared with that in a vowel segment when it is a consonant segment.
Therefore, it is highly possible that the consonant segment is discriminated as the non-active voice segment, so that a corresponding active voice frame is not transmitted in the consonant (active voice) segment which has been mistakenly discriminated as explained above, which is very likely to cause unwanted deterioration in speech quality.
Furthermore, when the level of the aforesaid noise is excessively high, there is a possibility that transmission of the whole active voice frame which corresponds to most part of the aural signal on which the noise is superimposed is restricted.
Incidentally, these problems can be solved, for example, when a threshold value for the feature value or the like which serves as the basis of the discrimination is set at such a value to cause the active voice frame to be easily discriminated as the active voice segment.
When the threshold value as mentioned above is applied, however, the probability is increased that the active voice frame is discriminated as the active voice segment even though it corresponds to the non-active voice segment and an hour rate of the active voice segment may possibly become excessively high, so that there is a possibility that reduction in power consumption, suppression of interference, and effective utilization of a radio frequency as stated above cannot be fully realized.
It is an object of the present invention to provide a voice activity detecting device which is flexibly adaptable to various features of an aural signal and to noise be superimposed on the aural signal and is capable of discriminating between an active voice segment and a non-active voice segment with high accuracy, and also to provide a voice activity detecting method.
It is another object of the present invention that even when an active voice segment includes many segments such as a consonant segment in which the quality of an aural signal is low because of its low amplitude, the segments are determined as a part of an active voice segment with high reliability.
It is still another object of the present invention to determine each active voice frame as a part of an active voice segment with high accuracy.
It is yet another object of the present invention to reduce required throughput or enhance responsiveness.
It is yet another object of the present invention to determine even active voice frames having noise of a high level superimposed on and a low SN ratio as a part of an active voice segment with high accuracy.
It is yet another object of the present invention that communication equipments and other electronic equipments to which the invention is applied, are able to flexibly adapt to an acoustic environment in which an acousto-electric converting section for generating an aural signal is disposed, or to a characteristic and performance of an information source of the active voice signal, and they are able to discriminate between an active voice segment and a non-active voice segment of this aural signal with high reliability so that desired performance suitable for the discrimination result and effective utilization of resources can be achieved.
The above-described objects are achieved by a voice activity detecting device and a voice activity detecting method which are characterized in that a probability that an active voice frame belongs to an active voice segment, and the quality of the active voice frame are determined on an active-voice-frame basis, and the probability is weighted with the quality to output the resultant.
According to the voice activity detecting device and the voice activity detecting method as structured above, the higher quality each of the active voice frames has, with higher probability discriminated it is as the active voice segment and also with lower probability discriminated it is as a non-active voice segment.
The above-described objects are also achieved by a voice activity detecting device and a voice activity detecting method which are characterized in that a probability that an active voice frame belongs to an active voice segment, and the quality of the active voice frame are determined on an active-voice-frame basis so that the level of the active voice frame for which the probability is to be determined is set at a lower value as an active voice frame has higher quality.
According to the voice activity detecting device and the voice activity detecting method as structured above, since a heavier weighting is given to instantaneous values of the aural signal included in each of the active voice frames as the active voice frame has lower quality, it is possible to determine, at a large value, an accuracy that the resulting aural signal given as a sequence of instantaneous values belongs to the active voice segment.
The above-described objects are also achieved by a voice activity detecting device and a voice activity detecting method which are characterized in that a probability that an active voice frame belongs to an active voice segment and the quality of the active voice frame are determined on an active-voice-frame basis so that a gradient in or a threshold value of a companding characteristic is set at a larger value as the active voice frame has higher quality, the companding characteristic being to be applied to companding processing of the active voice frame for which the probability is to be determined.
According to the voice activity detecting device and the voice activity detecting method as structured above, the companding processing is performed such that the lower quality an aural signal has, the more heavily weighted instantaneous values of the aural signal included in each of the active voice frames are.
The above-described objects are also achieved by a voice activity detecting device which is characterized in that a feature of an active voice segment and/or a feature of a non-active voice segment is/are determined for each active voice frame, and these features are employed as quality.
According to the voice activity detecting device as structured above, it is possible to obtain the quality of an aural signal with stability under application of various technologies which realize active voice analysis or speech analysis.
The above-described objects are also achieved by a voice activity detecting device and a voice activity detecting method which are characterized in that assessed noise-power is determined for each active voice frame and the assessed noise-power is employed as quality.
According to the voice activity detecting device as structured above, the assessed noise-power is generally calculated by a simple arithmetic operation.
The above-described objects are also achieved by a voice activity detecting device which is characterized in that assessed noise-power and an assessed value for an SN ratio are determined for each active voice frame, and values given as a monotone nonincreasing function of the former and as a monotone nondecreasing function of the latter are employed as quality.
According to the voice activity detecting device as structured above, it is possible to determine, as non-active voice segment, even active voice frames having noise of a high level superimposed on and a small SN ratio with high accuracy.
The above-described objects are also achieved by a voice activity detecting device which is different from the voice activity detecting devices previously described in that a standardized random variable is employed in replace of assessed noise-power.
In the voice activity detecting device as structured above, a large absolute value of the standardized random variable signifies that a peak value of amplitude of an active voice frame is larger than standard amplitude of an aural signal, and that there is a high possibility that noise of a high level is superimposed on this active voice frame, and, that is, xe2x80x98the larger the absolute value is, the higher the possibility becomesxe2x80x99. On the other hand, when the absolute value is smaller than the standard amplitude, it signifies that the peak value of the amplitude of the active voice frame is smaller than the standard amplitude of an aural signal, and the level of the noise superimposed on this active voice frame is low, and, that is, xe2x80x98the smaller the absolute value, the smaller the peak value and the lower the level of noisexe2x80x99.
Therefore, the standardized random variable can substitute for the aforesaid assessed noise-power.
The above-described objects are also achieved by a voice activity detecting device which is characterized in that a standardized random variable is calculated approximately based on amplitude distribution of an active voice frame and the maximum value of the amplitude distribution.
According to the voice activity detecting device as structured above, the aforesaid standardized random variable can be calculated by a simple arithmetic operation.
The above-described objects are also achieved by a voice activity detecting device which is characterized in that previously obtained qualities on an active-voice-frame basis are integrated in order of time sequence to employ the resultant as quality.
According to the voice activity detecting device as structured above, it is able to reduce or suppress components of steep fluctuation which may accompany with the quality of aural signals obtained in order of time sequence.
The above-described objects are also achieved by a voice activity detecting device which is characterized in that previously obtained qualities on an active-voice-frame basis are integrated in order of time sequence to employ the resulting values as quality, the values being obtained by weighting the integration result with a smaller value as the integration result is larger.
According to the voice activity detecting device as structured above, subsequently given active voice frames are determined as active voice segment with higher accuracy as previously given active voice frames have higher quality and the high quality is gained at a larger hour rate.