1. Field of the Invention
The present invention relates to a signal processing device, a signal processing method, and a program, and particularly to a signal processing device, a signal processing method, and a program that can obtain a feature quantity, for example autocorrelation or YIN that makes it possible to detect a section having periodicity in an input signal with high accuracy, for example.
2. Description of the Related Art
There is for example autocorrelation as periodicity information indicating periodicity of an audio signal. Autocorrelation is used as a feature quantity for picking up voiced sound of speech in speech recognition, detection of speech sections, and the like (see for example U.S. Pat. No. 6,055,499 (Patent Document 1 hereinafter) and Using of voicing features in HMM-based speech Recognition, D. L. Thomson, Chengalvarayan, Lucent, 2002 Speech Communication (Non-Patent Document 1), Robust Speech Recognition in Noisy Environments: The 2001 IBM Spine Evaluation System, B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, IBM, ICASSP2002 (Non-Patent Document 2), Extraction Methods for Voicing Feature for Robust Speech Recognition, Andras Zolnay, Ralf Schluter, and Hermann Ney, RWTH Aachen, EUROSPEECH 2003 (Non-Patent Document 3), USING SPEECH/NON-SPEECH DETECTION TO BIAS RECOGNITION SEARCH ON NOISY DATA, Francoise Beaufays, Daniel Boies, Mitch Weintraub, Qifeng Zhu, Nuance Communications, ICASSP2003 (Non-Patent Document 4), VOICING FEATURE INTEGRATION IN SRI'S DECIPHER LVSCR SYSTEM, Martin Graciarena, Horacio Franco, Jing Zheng, Dimitra Vergyri, Andreas Stolcke, SRI, ICASS2004 (Non-Patent Document 5), A LINKED-HMM MODEL FOR ROBUST VOICING AND SPEECH DETECTION, Sumit Basu, Microsoft Research, ICASSP2003 (Non-Patent Documents 6)). In addition, autocorrelation of an audio signal is used for detection of fundamental frequency (pitch frequency) of speech (see for example, Analysis, enhancement and evaluation of five pitch determination techniques, Peter Vapre, Michael S. Scordilis, Pansonic, Univ. Miami, Speech Communication 37(2002), pp. 249 to 270, referred to as Non-Patent Document 7).
In addition to autocorrelation, there is for example YIN recently proposed as periodicity information (see for example, YIN, a fundamental frequency estimator for speech and music, Alain de Chevigne', Hideki Kawahara, Japan Acoustic Society Am. 111 (4), April 2002, referred to as Non-Patent Document 8). YIN is used for detection of fundamental frequency of speech.
Autocorrelation is a high value when there is a high degree of periodicity, whereas autocorrelation is a value of zero when there is no periodicity. On the other hand, as opposed to autocorrelation, YIN is a value of zero when there is a high degree of periodicity, whereas YIN is a high value (1) when there is no periodicity. Description will hereinafter be made of a case where autocorrelation is used as periodicity information. However, when YIN is used as periodicity information, it suffices to use 1-YIN in place of normalized autocorrelation to be described later, or to read a maximum value of normalized autocorrelation as a minimum value of YIN and a read a minimum value of normalized autocorrelation as a maximum value of YIN.
While there are a number of kinds of methods for calculating autocorrelation, description will be made below of one of the methods.
A sample value at time t of the input signal of a time series samples at a predetermined sampling frequency will be expresses as X(t). A range of T samples for a fixed time T, that is, from a time t to a time t+T−1 will be referred to as a frame, and a time series of T sample values of an nth frame (number-n frame) from a start of the input signal will be described as a frame (or frame data) x(n).
The autocorrelation R′(x(n),τ) of the frame x(n) of the input signal X(t) can be calculated by Equation (1), for example.
                                          [                          Equation              ⁢                                                          ⁢              1                        ]                    ⁢                                          ⁢                                          ⁢                                                    R                ′                            ⁡                              (                                                      x                    ⁡                                          (                      n                      )                                                        ,                  τ                                )                                      =                                          1                T                            ⁢                                                ∑                                      i                    =                    t                                                        t                    +                    T                    -                    1                    -                    τ                                                  ⁢                                                                  ⁢                                                      x                    ⁡                                          [                      i                      ]                                                        ⁢                                      x                    ⁡                                          [                                              i                        +                        τ                                            ]                                                                                                          ⁢                                                      (        1        )            
The autocorrelation of a signal is a value indicating correlation between the signal and a signal obtained by shifting a same signal as the signal by a time τ. The time τ is referred to as a lag.
The autocorrelation R′(x(n),τ) of the frame x(n) may be obtained by subtracting an average value of T sample values X(t), X(t+1), . . . , and X(t+T−1) of the frame x(n) from the T sample values and using a result of subtraction in which the average value of the T sample values is zero, the result of subtraction being obtained as a result of subtracting the average value of the T sample values X(t), X(t+1), . . . , and X(t+T−1) of the frame x(n) from the T sample values.
Autocorrelation resulting from normalizing the autocorrelation R′(x(n),τ) obtained by Equation (1) is referred to as normalized autocorrelation.
When the autocorrelation resulting from normalizing the autocorrelation R′(x(n),τ) obtained by Equation (1) is expressed as R(x(n),τ), the normalized autocorrelation R(x(n),τ) is for example obtained by normalizing the autocorrelation R′(x(n),τ) of Equation (1) by autocorrelation R′(x(n),0) when the lag τ is zero, that is, calculating an equation R(x(n),τ)=R′(x(n),τ)/R′(x(n),0).
A maximum value of magnitude of the normalized autocorrelation R(x(n),τ) when the lag τ is changed is one when the input signal X(t) has perfect periodicity, that is, the input signal X(t) is a time series with a certain cycle T0, and the cycle T0 is equal to or less than the time length (frame length) T of the frame.
The normalized autocorrelation R(x(n),τ) is a value close to zero when the input signal X(t) does not have periodicity and the magnitude of the lag τ is substantially larger than zero. Incidentally, the normalized autocorrelation R(x(n),τ) is one when the lag τ is zero.
From the above, the normalized autocorrelation R(x(n),τ) can assume a value from −1 to +1.
Voiced sound of a human has a high degree of, if not perfect, periodicity.
FIG. 1 is a waveform chart showing an audio signal of voiced sound of a human. In FIG. 1, an axis of abscissas indicates time, and an axis of ordinates indicates the amplitude (level) of the audio signal.
It is clear from FIG. 1 that the audio signal of voiced sound of a human has periodicity. Incidentally, the audio signal of FIG. 1 is obtained by sampling at a sampling frequency of 16 kHz. The fundamental frequency of the audio signal of FIG. 1 is about 260 Hz (about 60 samples (≈16 kHz/260 Hz)).
The cycle (reciprocal of the cycle) of voiced sound of a human is referred to as fundamental frequency (pitch frequency). It is generally known that the fundamental frequency falls within a range of about 60 Hz to 400 Hz.
The range within which the fundamental frequency of voiced sound of a human falls will be referred to as a fundamental frequency range. When the normalized autocorrelation R(x(n),τ) is obtained with an audio signal of a human (an audio signal of speech of a human) used as the input signal X(t), a maximum value Rmax(x(n)) of the normalized autocorrelation R(x(n),τ) in a range of the lag τ corresponding to the fundamental frequency range is a value close to one in an audio signal section of voiced sound having periodicity.
Supposing that the sampling frequency of the input signal X(t) is for example 16 kHz and that the fundamental frequency range is for example a range of 60 Hz to 400 Hz as described above, 60 Hz corresponds to about 266 samples (=16 kHz/60 Hz), and 400 Hz corresponds to about 40 samples (=16 kHz/400 Hz).
Thus, the range of the lag τ corresponding to the fundamental frequency range is substantially larger than zero. Therefore the maximum value Rmax(x(n)) of the normalized autocorrelation R(x(n),τ) in the range of the lag τ corresponding to the fundamental frequency range is a value close to zero in a section without periodicity.
As described above, the maximum value Rmax(x(n)) of the normalized autocorrelation R(x(n),τ) in the range of the lag τ corresponding to the fundamental frequency range theoretically has values significantly different from each other in a section with periodicity and a section without periodicity, and can thus be used as a feature quantity of the audio signal as the input signal X(t) in speech processing such as detection of speech sections, speech recognition, and the like.
FIG. 2 shows the audio signal as the input signal X(t) and various signals (information) obtained by processing the audio signal.
A first row from the top of FIG. 2 is a waveform chart of the audio signal as the input signal X(t). In the first row from the top of FIG. 2, an axis of abscissas indicates time (sample points), and an axis of ordinates indicates amplitude.
Incidentally, the audio signal X(t) in the first row from the top of FIG. 2 is obtained by sampling at a sampling frequency of 16 kHz.
A second row from the top of FIG. 2 shows a frequency spectrum obtained by subjecting the audio signal X(t) to an FFT (Fast Fourier Transform). In the second row from the top of FIG. 2, an axis of abscissas indicates time (frames), as an axis of ordinates indicates numbers for identifying so-called bins (frequency components) of the FFT.
Incidentally, because a 512-point (512-sample) FFT is performed as the FFT, one bin corresponds to about 32 Hz. In the second row from the top of FIG. 2, the magnitude of each frequency component is represented by shading.
A third row from the top of FIG. 2 shows the maximum value Rmax(x(n)) of the normalized autocorrelation R(x(n),τ) of the input signal X(t) in the first row (the frame x(n) obtained from the input signal X(t) in the first row) in the range of the lag τ corresponding to the fundamental frequency range. In the third row from the top of FIG. 2, an axis of abscissas indicates time (frames), and an axis of ordinates indicates the maximum value Rmax(x(n)).
The maximum value Rmax(x(n)) of the normalized autocorrelation R(x(n),τ) in the range of the lag τ corresponding to the fundamental frequency range will hereinafter be referred to as lag range maximum correlation Rmax(x(n)) as appropriate.
A fourth row from the top of FIG. 2 shows the power of the input signal X(t) in the first row (the frame x(n) obtained from the input signal X(t) in the first row), that is, a value as a log of a sum total of respective squares of the T sample values of the frame x(n) (which value will hereinafter be referred to as frame log power as appropriate). In the fourth row from the top of FIG. 2, an axis of abscissas indicates time (frames), and an axis of ordinates indicates the frame log power.
Parts enclosed by a rectangle in FIG. 2 represent a speech section. Specifically, parts enclosed by a first rectangle, a second rectangle, and a third rectangle from a left in FIG. 2 represent sections in which the utterances of “stop”, “emergency stop”, and “freeze” were made in Japanese.
The audio signal X(t) in the first row from the top of FIG. 2, the frequency spectrum in the second row, and the frame log power in the fourth row do not noticeably differ between the speech sections and non-speech sections. It is therefore understood that it is difficult to detect speech sections using the audio signal X(t), the frequency spectrum, or the frame log power.
On the other hand, the lag range maximum correlation Rmax(x(n)) in the third row from the top of FIG. 2 is a value close to one in the speech sections, and is a value close to zero, which value is substantially lower than one, in the non-speech sections.
It is thus understood that the lag range maximum correlation Rmax(x(n)) is a feature quantity effective in detecting speech sections.