The present technology relates to a tonal component detection method, a tonal component detection apparatus, and a program.
Components constituting a one-dimensional time signal such as voice or music are broadly classified into three types of representations: (1) a tonal component, (2) a stationary noise component, and (3) a transient noise component. The tonal component corresponds to a component caused by the stationary and periodic vibration of a sound source. The stationary noise component corresponds to a component caused by a stationary but non-periodic phenomenon such as friction or turbulence. The transient noise component corresponds to a component caused by a non-stationary phenomenon such as a blow or a sudden change in a sound condition. Among them, the tonal component is a component that faithfully represents the intrinsic properties of a sound source itself, and thus it is particularly important when analyzing the sound.
The tonal component obtainable from an actual sound may often be a plurality of sinusoidal components which are gradually changed over time. The tonal component may be represented, for example, as a horizontal stripe-shaped pattern on a spectrogram representing amplitudes of the short-time Fourier transform with a time series, as shown in FIG. 8. FIG. 9 illustrates a spectrum in which frames in the vicinity of 0.2 seconds on the time axis in FIG. 8 are extracted. In FIG. 9, true tonal components to be detected for reference are indicated by directional arrows. The high-accuracy detection of the time and frequency in which the tonal components are present from such a spectrum becomes a fundamental process for many application techniques such as sound analysis, coding, noise reduction, and high-quality sound reproduction.
The detection of tonal components has been made from the past. A typical technique of detecting tonal components includes a method of obtaining an amplitude spectrum at each of the short time frames, detecting local peaks of the amplitude spectrum, and regarding all of the detected peaks as tonal components. One disadvantage of this method is that a large number of erroneous detections are made, because none of the local peaks becomes necessarily tonal components.
Incidentally, local peaks occurred in the amplitude spectrum includes (1) a peak due to the tonal component, (2) a side lobe peak, (3) a noise peak, and (4) an interference peak. FIG. 10 shows results obtained by detecting local peaks in amplitude spectrum on the spectrogram of FIG. 8 and the results are indicated by black dots. It will be found that the black horizontal stripes, i.e. tonal components shown in FIG. 8 are detected in the form of a horizontal line shape in FIG. 10 as well. However, on the other hand, it will be found that a large number of peaks are also detected from portions such as noise components. FIG. 11 shows results obtained by similarly detecting local peaks based on the spectrum of FIG. 9, and the results are indicated by black dots. It will be found that there are a large number of erroneously detected peaks in FIG. 11 as compared to accurately detected tonal components in FIG. 9.
For the method described above, an approach for improving the detection accuracy may include, for example, (A) method of setting a threshold for the height of each local peak and then not detecting local peaks having a smaller value than the threshold, and (B) method of connecting local peaks across multiple frames in a time direction according to the local neighbor rule and then excluding components which are not connected more than a certain number of times.
The method of (A) is assumed that the magnitude of tonal components is greater than that of noise components at all times. However, this assumption is unreasonable and is not true in many cases, thus its performance improvement will be limited. Actually, the magnitude of the peak erroneously detected in the vicinity of 2 kHz on the frequency axis of FIG. 11 is almost the same as that of the tonal component in the vicinity of 3.9 kHz, thus this assumption is not true.
The method of (B) is disclosed in, for example, R. J. McAulay and T. F. Quatieri: “Speech Analysis/Synthesis Based on a Sinusoidal Representation,” IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 34, No. 4, 744/754 (August 1986), and J. O. Smith III and X. Serra, “PARSHL: An Analysis/Synthesis Program for Non-Harmonic Sounds Based on a Sinusoidal Representation”, Proceedings of the International Computer Music Conference (1987). This method employs a property that tonal components have temporal continuity (e.g., in case of music, a tonal component is often continued for a period of time more than 100 ms). However, because peaks in any other components than the tonal components may be continued and a shortly segmented tonal component is not detected, it is not necessarily mean that sufficient accuracy can be achieved in many applications.