Conventionally, there has been introduced a mixed audio separation apparatus as an apparatus which separates a desired audio from among a mixed audio. In mixed audio separation processing, a mixed audio is subjected to a frequency analysis so as to generate a spectrogram where the y axis represents frequency, the x axis represents time, and the power intensity of each of the points are shown by gray scale. In addition, in the processing, the desired audio is separated from the mixed audio on the spectrogram. Through this processing, audio separation performance becomes high. As for a frequency conversion method from an audio to a spectrogram like this; that is, an audio frequency analysis method, the Fourier transform is generally used. Therefore, the Fourier transform plays an important role in the mixed audio separation processing.
As conventional arts for performing frequency analyses, the cosine transform (for example, refer to Reference 2) and the wavelet transform (for example, refer to Reference 1) are known in addition to the above-mentioned Fourier transform (for example, refer to the References 1 and 2). In these conventional arts, a frequency analysis is performed using a cross-correlation (convolution) between an analysis waveform and each reference waveform which has a predetermined time width.
In the Fourier transform, a frequency analysis is performed using cosine waveforms and sine waveforms each of which has a time width determined based on a temporal resolution (spatial resolution) and a frequency resolution (each of the cosine waveforms and sine waveforms is a reference waveform having a value of zero in a time segment other than the time width).
Here, determining the time width of each reference waveform is equivalent to determining a reference frame width (time width) in the Fourier transform. In addition, a frequency analysis may be performed by multiplying an analysis waveform with a window function which has a value other than zero in a target segment (time segment where a reference waveform is present).
FIG. 1 is a diagram illustrating a method of the Fourier transform (discrete Fourier transform). Frequency information (an amplification spectrum and a phase spectrum) of an analysis waveform is obtained by calculating, using Expression 1, a cross-correlation (convolution) between the analysis waveform shown in FIG. 1(c) and each reference waveform (FIG. 1(b)). The used reference waveforms are a cosine wave and a sine wave each of which has a time width including N-points in a sampling point shown in FIG. 1(a). Here, an index k in Expression 1 is an index indicating a reference frequency, and in the Fourier transform, pieces of frequency information of plural reference frequencies are to be obtained in parallel. A great index value shows that a high frequency is used to obtain an analysis result.
                                          X            k                    =                                    ∑                              n                =                1                            N                        ⁢                                          x                n                            ⁢                                                ⅇ                                                            -                      j                                        ⁢                                                                  2                        ⁢                                                                                                  ⁢                        π                        ⁢                                                                                                  ⁢                        kn                                            N                                                                      ⁢                                                                  (                                                      k                    =                    1                                    ,                  2                  ,                  …                  ⁢                                                                          ,                  N                                )                                                    ⁢                                  ⁢        where                            [                  Expression          ⁢                                          ⁢          1                ]                                          x          n                ⁢                                  (                              n            =            1                    ,          2          ,          …          ⁢                                          ,          N                )                            [                  Expression          ⁢                                          ⁢          2                ]            is a value obtained by sampling an analysis waveform,Xk (k=1, 2, . . . , N)  [Expression 3]is frequency information corresponding to the analysis waveform, and
                              ⅇ                                    -              j                        ⁢                                          2                ⁢                                                                  ⁢                π                ⁢                                                                  ⁢                kn                            N                                      =                              cos            ⁡                          (                                                2                  ⁢                                                                          ⁢                  π                  ⁢                                                                          ⁢                  kn                                N                            )                                -                      j            ⁢                                                  ⁢                          sin              ⁡                              (                                                      2                    ⁢                                                                                  ⁢                    π                    ⁢                                                                                  ⁢                    kn                                    N                                )                                                                        [                  Expression          ⁢                                          ⁢          4                ]            is a value constituted of a cosine waveform and a sine waveform each of which has a time width including N-points; that is, a value of the reference waveform.
In the Fourier transform, when the time width of a reference waveform is set, both the values of a temporal resolution and a frequency resolution are automatically determined. The “temporal resolution” mentioned here means the length of a time segment which is averaged at the time of obtaining the cross-correlation (convolution) between the analysis waveform and each reference waveform. The “frequency resolution” mentioned here means the frequency band width which the frequency components of the analysis waveform pass through, and the band width includes the reference frequency.
FIG. 2 is a diagram indicating a relationship between the reference waveforms each having a predetermined time width and frequency characteristics obtained when performing a frequency analysis of the analysis waveform using the reference waveforms. FIG. 2 shows frequency characteristics in the case where frequency analysis is performed using three-types of temporal resolutions; that is, a 1-cycle temporal resolution, a 2-cycle temporal resolution and a 3-cycle temporal resolution which are listed from left to right in FIG. 2. FIG. 2 shows the relationships between the reference waveforms and frequency characteristics in the case where the frequency analysis is performed.
It is known from FIG. 2 that a frequency resolution is low when a frequency analysis is performed by increasing a temporal resolution using the 1-cycle cosine waveform as a reference waveform, and that a frequency resolution is high when a frequency analysis is performed by lowering a temporal resolution using the 3-cycle cosine waveform (whose time width is tripled compared to the 1-cycle cosine waveform). In this way, in the conventional arts, a temporal resolution (the length of a time segment to be averaged at the time of obtaining the cross-correlation between an analysis waveform and each reference waveform) and a frequency resolution are in a trade-off relationship.
Note that, in the case of the Fourier transform of the analysis waveform having serial values, a frequency analysis is to be performed using a cross-correlation (convolution) between the analysis waveform and each reference waveform indicated by integral in stead of using Σ operation in Expression 1.
In the cosine transform, a frequency analysis is performed using a cosine waveform having a time width determined based on a temporal resolution (spatial resolution) and a frequency resolution (the cosine waveform is a reference waveform having a value of zero in a time segment other than the time width).
FIG. 3 is a diagram illustrating the cosine transform (discrete cosine transform). Frequency information (which is represented as a combination of an amplification spectrum and a phase spectrum) of an analysis waveform is obtained by calculating, using Expression 5 and Expression 6, a cross-correlation (convolution) between an analysis waveform and each reference waveform which are shown in FIG. 3(c), (FIG. 3(b)). The used reference waveform is a cosine wave having a time width including N-points in the sampling point shown in FIG. 3(a) (the cosine waveform is a reference waveform having a value of zero in a time segment other than the time width). Here, an index k in Expression 5 and Expression 6 is an index indicating a reference frequency, and in the cosine transform, pieces of frequency information of plural reference frequencies are to be obtained in parallel. A great index value shows that a high frequency is used to obtain an analysis result.
                                          X            k                    =                                    ∑                              n                =                1                            N                        ⁢                                          x                n                            ⁢                              c                k                            ⁢              cos              ⁢                                                                    (                                                                  2                        ⁢                                                                                                  ⁢                        n                                            -                      1                                        )                                    ⁢                  π                  ⁢                                                                          ⁢                  k                                                  2                  ⁢                                                                          ⁢                  N                                                                    ⁢                                  ⁢                  (                                    k              =              1                        ,            2            ,            …            ⁢                                                  ,            N                    )                                    [                  Expression          ⁢                                          ⁢          5                ]            ck=1 (k=0), ck=√{square root over (2)} (k=2, . . . , N)  [Expression 6]
wherexn (n=1, 2, . . . , N)  [Expression 7]is a value obtained by sampling an analysis waveform,Xk (k=1, 2, . . . , N)  [Expression 8]is frequency information corresponding to the analysis waveform.
In the cosine transform, when the time width of a reference waveform is set, both of a temporal resolution (the length of a time segment to be averaged at the time of obtaining the cross-correlation between an analysis waveform and each reference waveform) and a frequency resolution are automatically determined. This mechanism is the same as that of the Fourier transform (refer to FIG. 2).
In the case of the cosine transform in the analysis waveform having serial values, a frequency analysis is performed using, in Expression 5, a cross-correlation (convolution) between the analysis waveform and each reference waveform indicated by integral.
In the wavelet transform, a frequency analysis is performed using a wavelet basis function having a time width determined based on a temporal resolution (spatial resolution) and a frequency resolution.
FIG. 4 is a diagram illustrating the wavelet transform. In FIG. 4, the frequency information (an amplification spectrum and a phase spectrum) of an analysis waveform is obtained by calculating the cross-correlation (convolution) between the analysis waveform shown in FIG. 4(c) and the reference waveform shown in FIG. 4(a) according to the expression shown in FIG. 4(b); that is Expression 9 which uses a wavelet basis function (the reference waveform having a value of zero in a time segment other than a time width) which is a reference waveform having the predetermined time width shown in FIG. 4(a).
                                          (                                          W                ψ                            ⁢              x                        )                    ⁢                      (                          b              ,              a                        )                          =                              1                          a                                ⁢                      ∫                                          x                t                            ⁢                                                ψ                  ⁡                                      (                                                                  t                        -                        b                                            a                                        )                                                  _                            ⁢                              ⅆ                t                                                                        [                  Expression          ⁢                                          ⁢          9                ]            where xt is an analysis waveform.
                    ψ        ⁡                  (                                    t              -              b                        a                    )                                    [                  Expression          ⁢                                          ⁢          10                ]            is a wavelet basis function.
In the wavelet transform, when the time width of a wavelet basis function is determined, both of the temporal resolution (the length of a time segment to be averaged at the time of obtaining the cross-correlation between an analysis waveform and each reference waveform) and the frequency resolution are automatically determined. This mechanism is the same as that of the Fourier transform (refer to FIG. 2).
Note that, in the wavelet transform, it is possible to set a temporal resolution (or a frequency resolution) independently for each reference frequency. On the other hand, in the Fourier transform, all the reference frequencies are to have the same temporal resolution (time width of a reference time window) and frequency resolution, and thus it is impossible to determine a temporal resolution and a frequency resolution independently for each reference frequency. Note that the following is also true of in the wavelet transform; a frequency resolution is automatically determined based on the corresponding temporal resolution; and vice versa.
In the above description, Mexican Hat is used as the wavelet basis function used here, but it should be noted that there are other wavelet basis functions such as Daubechies, Meyer and Gabor in the wavelet transform.
Reference 1: “Ueiburetto ni yoru Shingo Shori to Gazo Shori (Signal Processing and Image Processing through Wavelet)”, pp. 35 to 39, pp. 49 to 52, Hiroki Nakano and other two authors, Aug. 15, 1999, Kyoritsu Press.
Reference 2: “Patan Joho Shori (Pattern Image Processing)”, pp. 14 to 19, Seiichi Nakagawa, Mar. 30, 1999, Maruzen CO. Ltd.