1. Field of the Invention
The present invention relates to a technique for extracting only speech spurt parts from a voice signal, and particularly to a technique applicable to voice packet communications, voice store processing, or the like.
2. Description of Related Art
Extracting speech spurt parts from a voice signal is utilized in communications or voice signal storing, for example, because only these parts serve as effective information of a signal to be transferred or stored. Applying this technique makes possible effective use of communications network facilities or voice storing equipment. Therefore, many approaches have been conventionally proposed as the speech spurt extracting techniques.
The conventional voice packet communications transfer only effective speech parts of voice signals in information transmission.
FIG. 1 is a block diagram showing the application of the speech spurt parts extracting technique in the voice packet communications. In FIG. 1, the reference numeral 101 designates a device for converting the voice into an electric signal (analog signal), that is, generally a telephone. The reference numeral 102 designates a packet transmitter, and 103 designates a packet receiver. The reference numeral 104 designate a device for converting the electric signal into voice, that is, a telephone in general.
The packet transmitter 102 comprises an analog-to-digital (A/D) converter 105 for converting the analog signal into a digital signal, a speech spurt detector 106 for deciding and extracting only speech spurts from the digital voice signal, and a voice packet transmitter 107 for assembling a packet by adding voice packet control information to the extracted speech spurt signal and for transmitting it to the party equipment. On the other hand, the packet receiver 103 comprises a voice packet receiver 108 for extracting a speech spurt signal from the received voice packet, a voice regenerator 109 for regenerating the speech spurt signal and a mute signal, thereby recovering the digital voice signal, and a digital-to-analog (D/A) converter 110 for converting the digital signal into an analog signal.
A voice signal 111 is composed of speech spurt signals denoted by shaded parts and mute signals denoted by unshaded parts. The voice signal is input to the packet transmitter 102 where the speech spurt detector 106 extracts the speech spurt parts. Then, as indicated by the reference numeral 112, voice packets are assembled from the voice signals in the extracted speech spurt parts and a header is added to each of them. The voice packet is restored by the packet receiver 103 from the packet signal 112 and output as a voice signal 113.
Thus, the speech spurt detector 106 extracts only the speech spurt parts of the voice uttered by a talker.
An inappropriate technique for extracting the speech spurts from the voice will result in breaks of extracted voice, or omissions of initial and/or final positions of words. This will presents a problem of degrading the voice reproduced from the extracted speech spurts.
In addition, it is necessary to take into account that the environment of the source is not always quiet but is incessantly interfered by external noise. The adverse effect of the noise presents another problem in that the noise may be misidentified as speech spurts and hence increases an extraction amount of the speech spurts, resulting in hindering an effective use of the equipment in spite of its purpose that only significant speech spurts should be detected. It is further necessary to consider the fact that the noise levels fluctuate every moment.
To solve these problems, various proposals have been made which are roughly divided into:
(1) A method of setting a speech spurt level in advance, and identifying that signals exceeding the level are speech spurts. PA1 (2) A method of detecting voices considering a zero-cross frequency with utilizing the difference in frequencies between a signal and noise to distinguish voices from noise. PA1 (3) A method of detecting voices using a combination of (1) and (2). PA1 a storage for storing an input voice signal; PA1 a decision portion for making a decision of speech spurt sections and mute sections from the input voice signal using a threshold value; PA1 a mute level statistical processor for estimating noise distribution of a signal in the mute sections by statistically processing the mute sections decided by the decision portion; PA1 a speech spurt detecting threshold value decision portion for deciding a speech spurt detecting threshold value considering the noise distribution such that the threshold value is unaffected by noise; and PA1 a speech spurt transfer portion for outputting from the storage the voice signal in the speech spurt sections. PA1 storing an input voice signal; PA1 making a decision of speech spurt sections and mute sections from the input voice signal using a threshold value; PA1 estimating noise distribution of a signal in the mute sections by statistically processing the mute sections; PA1 deciding a speech spurt detecting threshold value considering the noise distribution such that the threshold value is unaffected by noise; and PA1 outputting the voice signal in the speech spurt sections from the stored voice signal. PA1 (a) The present invention dynamically varies the speech spurt detecting threshold value in response to the input signal. PA1 (b) The dynamic variation in the speech spurt detecting threshold value is determined by statistically processing the noise characteristics in mute sections. PA1 (c) Considering the changes in the environment of a sound source, the mute sections to be statistically processed are assumed as a rule to have a level below the speech spurt detecting threshold value in an initial state, whereas they are selected during a hangover time from its latter part in which it is highly probable that the voice has been nearly extinguished. PA1 (d) An error in the statistical processing is identified, and if it matches a particular condition, it is initialized.
Although the foregoing conventional techniques are effective to some extent in distinguishing voices from noise, sounds with a wide frequency range like musical sounds contained in an audio signal can be misidentified as noise by the foregoing method (3), for example.
In particular, it is not rare in practice that musical sounds (for example, a call holding sound of a telephone) are mixed in the audio signal excluding the case of speech recognition and production. In view of this, the speech spurts must be extracted from environmental sounds including musical sounds.