In the technical field mentioned above, as a sound identifier which identifies the characteristic of the sound, an audio fingerprint (audio electronic fingerprint) which is obtained by analyzing a sound signal is known.
For example, a sound processing system in non-patent document 1 cuts out frames of 25 ms which overlap while shifting from a sampled sound signal for 5-10 ms. And the sound processing system performs fast Fourier transform (FFT: Fast Fourier Transform) processing, logarithm processing and discrete cosine transform (DCT: Discrete Cosine Transform) processing to the sound signal in the cuts out frames and generates mel frequency cepstrum. The sound processing system takes out 12th-16th dimensions which are lower dimensions of the mel frequency cepstrum as a mel frequency cepstrum coefficient (MFCC: Mel Frequency Cepstrum Coefficient) and generates an audio fingerprint from the time differences.
A sound processing system in non-patent document 2 cuts out frames of 370 ms which overlap while shifting for 11.6 ms. And the sound processing system generates an audio fingerprint expressed in 32 dimensions by discrete Fourier transform (DFT: Discrete Fourier Transform), logarithm processing and time and frequency differences for subband divided average power.
A sound processing system in non-patent document 3 cuts out frames of 370 ms which overlap while shifting for 11.6 ms. And the sound processing system generates an audio fingerprint expressed in 32 dimensions by discrete wavelet (Wavelet) transform, frequency differences and time differences.
Also, a sound processing system in patent document 1 cuts out frames of 10-30 ms which overlap, and generates a time—frequency segment via Fourier transform, division by mel scale or Bark scale, and mean value calculation by using a window function. And after two-dimensional DCT (Discrete Cosine Transform) is performed, its lower band is output as a voice characteristic amount.
In the sound processing system in patent document 1, though, for example, the voice characteristic amount of 112 elements is generated, considering processing speed when it is used, 30 elements in the lower band are selected as the voice characteristic amount for voice recognition or speaker recognition.
Also, a sound processing system in patent document 2 performs FFT to frames of 64 ms which overlap 50% and generates characteristic vectors, and for example, obtains a difference for a neighboring band pair of band of M=13 and generates an audio fingerprint encoded on the basis of the difference result.