Speech detection (SD) is a process to be performed to cut out a segment, during which a person is speaking, from a sound signal that is continuously input. This process is also called voice activity detection (VAD). Hereinafter, speech detection will also be referred to as “segment detection”.
Speech detection may be performed together with speech recognition, sound source extraction, and the like. In any case, a high degree of accuracy is required in segment detection.
In many speech recognition devices, for example, processing such as matching is performed on segments cut out through segment detection, and therefore, the accuracy of speech recognition is greatly affected by the accuracy of the segment detection Specifically, if a segment during which a person has actually spoken differs from a segment detected through a segment detection process, the detection will cause wrong recognition. In other cases, if a speech segment is wrongly detected even though any speech has not been emitted, a recognition process is performed on the sound in the segment, and the system wrongly operates in accordance with the wrong recognition result.
Meanwhile, segment detection might also be performed in a sound source extraction process to select and extract one speech from an obtained sound in which different sounds coexist. For example, in a case where a clear speech is to be extracted from a signal in which speeches and noise coexist, or in a case where a speech of one person is to be extracted while two or more persons are simultaneously speaking, an input signal in some sound source extraction systems needs to be divided into a segment during which only noise exists and a segment during which both noise and a speech coexist. To divide such an input signal, segment detection is performed.
There also are cases where sound source extraction is performed only when a target speech exists, and segment detection is performed to reduce the amount of calculation and prevent applications to silent segments. In such speech detection to be performed in conjunction with sound source extraction, operation with a high degree of accuracy is required even if an input signal is formed with a mixture of a speech and noise or a mixture of speeches.
Note that, conventional technologies related to speech detection are disclosed in Patent Document 1 (JP 2012-150237 A), Patent Document 2 (JP 4282704 B2), Patent Document 3 (JP 2010-121975 A), Patent Document 4 (JP 4182444 B2), Patent Document 5 (JP 2008-175733 A), and Patent Document 6 (JP 2013-44950 A), for example. Also, a conventional technology related to a sound source extraction process is disclosed in Patent Document 7 (JP 2012-234150 A), for example.