The technique of voice detection that determines an input signal into a voiced interval and a non-voiced interval has been in wide spread used in a variety of technical fields. Several examples are given below.                For example, in mobile communications, voice detection is used to improve the voice transmit efficiency, e.g.,        to improve compression efficiency for a non-voiced interval, or        not to transmit a non-voiced interval.        
In a noise canceller or an echo canceller, voice detection is used to estimate or determine a noise between non-voiced intervals.
Further, in a voice recognition system, voice detection is used to                improve performance, or        reduce processing amount.        
FIG. 10 illustrates a configuration of a typical voice detection apparatus (related technique). As regards this sort of the voice detection apparatus, reference may be made to, for example, the disclosure of Patent Document 1.
Referring to FIG. 10, this voice detection apparatus includes                an input signal acquisition unit 1 that slices the input signal on a per frame basis and acquires the so sliced frame-based input signal,        a feature value calculation unit 2 that calculates a feature value, used for voice detection, from the sliced frame-based input signal,        voice/non-voice decision unit 14 that compares a feature value with its threshold value stored in a threshold value storage unit 13, on a per frame basis, to distinguish between voice and non-voice, and        an interval shaping unit 16 that performs shaping of a decision result, which has been found on a per frame basis across a plurality of frames, based on a shaping rule stored in a shaping rule storage unit 15, to determine the voiced interval and the non-voiced interval.        
A large variety of feature values, calculated by the feature value calculation unit 2, are used for voice detection. An example of the feature value is a smoothed version of variations of the spectral power (see Patent Document 1). Other examples of the feature value may include                a value of SNR (signal-to-noise ratio) (see Non-Patent Document 1 (paragraph 4.3.3)),        a mean value of SNR (see Non-Patent Document 1 (paragraph 4.3.5)),        a zero-crossing number (see Non-Patent Document 2 (paragraph B.3.1.4)),        a likelihood ratio that uses a voice GMM (Gaussian Mixture Model) and a silent GMM (see Non-Patent Document 3), and        a combination of a plurality of feature values (see Non-Patent Document 4).        
The interval shaping unit 16 performs interval shaping in order to suppress coming out of voiced intervals or non-voiced intervals of shorter durations that may be produced in case the voice/non-voice decision unit 14 performs voice/non-voice decision on a per frame basis.
As a shaping rule, used for determining a voiced interval/non-voiced interval, Patent Document 1 has disclosed the following.
Condition (1): a voiced interval that has failed to satisfy the necessary minimum duration is not recognized as the voiced interval. In the following description, this necessary minimum duration is termed ‘voiced interval duration threshold value’.
Condition (2): a non-voiced interval that is sandwiched between voiced intervals and that satisfies the duration to be treated as a continuous voiced interval is combined with the both end voiced intervals, and the resulting interval is treated as a single voiced interval. In the present description, the duration to be treated as a continuous voiced interval is termed a ‘non-voiced interval duration threshold value’ because an interval greater than or equal to this duration is decided to be a non-voiced interval.
Condition (3): A pre-defined constant number of frames are appended to leading and trailing ends of a voiced interval. In the present description, the constant number of frames, appended to the leading and trailing ends of the voiced interval, are respectively termed ‘leading and trailing end margins’.
In the present voice detection apparatus, preset values are used for the threshold values for the feature values, found on a per frame basis and for parameters relating to the shaping rule.
Patent Document 1:
JP Patent Kokai Publication No. JP-P2006-209069A
Non-Patent Document 1:
ETSI EN 301 708 V7.1.1
Non-Patent Document 2:
ITU-T G.729 Annex B
Non-Patent Document 3:
A. Lee, K. Nakamura, R. Nishimura, H. Saruwatari, K. Shikano, ‘Noise Robust Real World Spoken Dialogue System using GMM Based Rejection of Unintended Inputs, “ICSLP-2004, Vol. 1, pp. 173-176, October 2004
Non-Patent Document 4:
Yusuke Kida and Tatsuya Kawahara, “Voice Activity Detection based on Optimally Weighted Combination of Multiple Features”, IPSJ SIG Technical Report, 2005-SLP-57(9)
Non-Patent Document 5:
Kenji Kita, ‘Stochastic Language Model’, chapter 6, pp. 155-162, 1999, University of Tokyo Press