The present invention relates to a method and circuit arrangement for automatically recognizing speech activity in transmitted signals.
For digital mobile telephone or speech memory systems, and in many other applications, it is advantageous to transmit speech encoding parameters discontinuously. In this was the bit rate can be reduced considerably during pauses in speech or time periods dominated by background noise. Advantages of discontinuous transmission in mobile terminals include lower energy consumption. Such lower energy consumption may be due to a higher mean bit rate for simultaneous services such as data transmission or to a higher memory chip capacity.
The extent of the benefit afforded by discontinuous transmission depends on the proportion of pauses in the speech signal and the quality of the automatic voice activity detection device needed to detect such periods. While a low speech activity rate is advantageous, active speech should not be cut off so as to adversely affect speech quality. This tradeoff is a basic challenge in devising automatic voice activity detection systems, especially in the presence of high background noise levels.
Known methods of automatic voice activity detection typically employ decision parameters based on average time values over constant-length windows. Examples include autocorrelation coefficients, zero crossing rates or basic speech periods. These parameters afford only limited flexibility for selecting time/frequency range resolution. Such resolution is normally predefined by the frame length of the respective speech encoder/decoder.
In contrast, the known wavelet transformation technique computes an expansion in the time/frequency range. The calculation results in low frequency range resolution but high time range resolution at high frequencies and low time range resolution but high frequency range resolution at low frequencies. These properties well-suited for the analysis of speech signals, have been used for the classification of active speech into the categories voiced, voiceless and transitional. See German Offenlegungsschrift 195 38 852 A1 xe2x80x9cVerfahren und Anordnung zur Klassifizierung von Sprachsignalenxe2x80x9d (Method of and Arrangement for Classifying Speech Signals), 1997, related to U.S. patent application Ser. No. 08/734,657 filed Oct. 21, 1996, which U.S. application is hereby incorporated by reference herein.
The known methods and devices discussed are not necessarily prior art to the present invention.
An object of the present invention is therefore to provide a method and a circuit arrangement, based on wavelet transformation, for voice activity detection to determine whether speech or speech sounds are present in a given time segment.
The present invention therefore provides a method of automatic voice activity detection, based on the wavelet transformation, characterized in that a voice activity detection circuit or module (5), controlling a speech encoder (7) and a speech decoder (22), as well as a background noise encoder (10) and a background noise decoder (23), is used to achieve source-controlled reduction of the mean transmission rate; a wavelet transformation is computed for each frame after segmentation of a speech signal, a set of parameters is determined from said wavelet transformations, and a set of binary decision variables is determined from said parameters, using fixed thresholds, in an arithmetic circuit or a processor (32), said decision variables controlling a decision logic (42), whose result provides a xe2x80x9cspeech present/no speechxe2x80x9d statement after time smoothing for each frame.
The present invention also provides a circuit arrangement for performing a method of automatic voice activity detection, based on wavelet transformation. The circuit arrangement is characterized in that the input speech signals go to the input (1) of a transfer switch (4). A voice activity detection circuit or module (5) is connected to the input (1), and the output of said voice activity detection circuit controls said transfer switch (4) and another transfer switch (13), and is connected to a transmission channel (16). The output of the transfer switch (4) is connected, via lines (7, 8), to a speech encoder (9) and a background noise encoder (10), whose outputs are connected, via lines (11, 12) to the inputs of the transfer switch (13), whose output is connected, via a line (15), to the input of the transmission channel (16). The transmission channel is connected to both another transfer switch (19) and, via a line (18), to the control of the transfer switch (19) and of a transfer switch (26) arranged at the output (27). A speech decoder (22) and a background noise decoder (23) are arranged between the two transfer switches (19 and 26).
The present method of automatic voice activity detection is applicable to speech encoders/decoders to achieve source-controlled reduction of the mean transmission rate. With the present invention, after segmentation of a speech signal, a wavelet transformation is computed for each frame to determine a set of parameters. From these parameters a set of binary decision variables is computed using fixed thresholds. The binary decision variables control a decision logic whose result delivers, after time smoothing, a xe2x80x9cspeech present/no speech presentxe2x80x9d statement for each frame. The present invention achieves a source-controlled reduction of the mean transmission rate by determining whether any speech is present in the time segment under consideration. This result can then be used for function control or as a pre-stage for a variable bit rate speech encoder/decoder.
Other advantageous embodiments of the present invention include:
(a) that after the wavelet transformation, a set of energy parameters is determined for each segment from the transformation coefficients and compared with fixed threshold values, whereby binary decision variables are obtained for controlling the decision logic (42), which provides an interim result for each frame at the output;
(b) that the interim result for each frame, determined by the decision logic, is post-processed by means of time smoothing, whereby the final xe2x80x9cspeech present or no speechxe2x80x9d result is formed for the current frame;
(c) that background detectors (36, 37) are controlled using signals for detecting background noise, and the detail coefficients (D) are analyzed in the rough time interval (N) and detail coefficients (D2) are analyzed in the finer time interval (N/P); P represents the number of subframes and the relationships Q1,Q2xcex5(1.L) and Q1 greater than Q2 apply: and
(d) that the input (1) is connected to a segmenting circuit (28), whose output is connected, via a line (29), to a wavelet transformation circuit (30), which is connected to the input of an arithmetic circuit or a processor (32) for calculating the energy values: the output of the processor (32) is connected, via a line (33) and parallel to a pause detector (34), to a circuit for computing the measure of stationarity (35), a first background detector (36), and a second background detector (37); the outputs of said circuits (34 through 37) are connected to a decision logic (42), whose output is connected to a smoothing circuit (44) for time smoothing, and the output of the smoothing circuit (44) is also the output (45) of the voice activity detection device.
Further advantages of the voice activity detection method and the respective circuit arrangement are explained in detail below with reference to the embodiments.