A phase vocoder technique is known as a technique for compressing and stretching an audio signal on a time axis. A phase vocoder apparatus as disclosed in NPL (Non Patent Literature) 1 performs, in a frequency domain, stretch or compression processing (time stretch processing) in a time direction, and pitch transform processing (pitch shift processing), by applying Fast Fourier Transform (FFT) or Short Time Fourier Transform (STFT) on a digital audio signal.
A pitch is also referred to as a pitch frequency, and represents the pitch of a sound. The time stretch processing is processing for stretching or compressing the time length of an audio signal without changing the pitch of the audio signal. The pitch shift processing is an example of frequency modulation processing and is processing for changing the pitch of an audio signal without changing the time length of the audio signal. The pitch shift processing is also referred to as pitch stretch processing.
When the reproduction rate of an audio signal is simply changed, both of the time length and the pitch of the audio signal are changed. On the other hand, when the reproduction rate of an audio signal having a time length stretched or compressed is changed without changing the original pitch, only the pitch of the audio signal may be transformed and the time length of the audio signal is returned to the original time length. For this reason, pitch shift processing may involve time stretch processing. Likewise, time stretch processing may involve pitch shift processing. In this way, the time stretch processing and the pitch shift processing have a relational correspondence.
The time stretch processing makes it possible to change the duration time (reproduction time) of an input audio signal without changing the spectrum characteristics of part of the spectrum signal obtained by performing FFT on the input audio signal. The principal is as indicated below.
(a) The audio signal processing apparatus which executes time stretch processing firstly divides the input audio signal into segments corresponding to constant time intervals, and analyses the segments corresponding to the constant time intervals (for example, for each unit of 1024 samples). At this time, the audio signal processing apparatus processes the input audio signal such that the respective segments are overlapped with at least one of the other segments by a time interval (for example, a unit of 128 samples) that is shorter than and within a unit of time (a time segment). Here, the time interval for overlap is referred to as a hop size.
In FIG. 30A, the hop size of an input signal is denoted as Ra. Likewise, an audio signal that is calculated by phase vocoder processing and is to be output is an audio signal divided into segments which are overlapped with at least one of the others by a time interval corresponding to a constant number of samples. In FIG. 30B, the hop size of the audio signal to be output is denoted as Rs. Rs>Ra is satisfied when performing a time stretch, and Rs<Ra is satisfied when performing time compression. Here, a description is given of the example of performing the time stretch (Rs>Ra). A time stretch rate r is defined according to Expression 1.
                    [                  Math          .                                          ⁢          1                ]                                                            r        =                              R            a                                R            s                                              (                  Expression          ⁢                                          ⁢          1                )            
(b) As described above, each of time block signals divided into segments corresponding to constant time intervals and partly overlapped with at least one of the others has a temporally coherent pattern in many cases. For this reason, the audio signal processing apparatus performs frequency transform on each time block signal. Typically, the audio signal processing apparatus performs frequency transform on each input time block signal to adjust the phase information. Next, the audio signal processing apparatus returns the frequency domain signal to a time domain signal as the time block signal to be output.
According to the above principle, a classical phase vocoder apparatus performs transform into the frequency domain using STFT, and performs the short time inverse Fourier transform after performing various kinds of adjustment processing in the frequency domain. In this way, time transform and pitch shift processing are performed. Next, the STFT-based processing is described.
(1) Analysis
First, the audio signal processing apparatus executes an analysis window function having a window length of L, for each time block unit including at least one overlap by the hop size Ra. More specifically, the audio signal processing apparatus transforms each of the blocks into a frequency domain block using FFT. For example, the frequency characteristics at the point uRa (u is an element of N) are calculated according to Expression 2.
                                              ⁢                  [                      Math            .                                                  ⁢            2                    ]                                                                              X          ⁡                      (                                          uR                a                            ,              k                        )                          =                                            ∑                              m                =                0                                            L                -                1                                      ⁢                                          x                ⁡                                  (                                                            uR                      a                                        ,                    m                                    )                                            ⁢                              h                ⁡                                  (                  m                  )                                            ⁢                              W                L                                  mk                  ⁢                                                                                                                      =                                                                  X                ⁡                                  (                                                            uR                      a                                        ,                    k                                    )                                                                    ·                          ⅇ                              jφ                ⁡                                  (                                                            uR                      a                                        ,                    k                                    )                                                                                        (                  Expression          ⁢                                          ⁢          2                )            
Here, h(n) denotes an analysis window function. Also, k denotes a frequency index, and the range is represented according to k=0, . . . , L−1. In addition, WLmk is calculated according to the following expression.WLmk=e−j2πmk/L  [Math. 3]
(2) Adjustment
The calculated phase information of the frequency signal which is the phase information of the frequency signal before being subjected to the adjustment is assumed to be φ(uRa, k). In the adjusted phase, the audio signal processing apparatus calculates a frequency component ω(uRa, k) having a frequency index k according to the following method.
First, in order to calculate the frequency component ω(uRa, k), the audio signal processing apparatus calculates an increment Δφku between (u−1) Ra and uRa which are consecutive analysis points, according to Expression 3.
                                              ⁢                  [                      Math            .                                                  ⁢            4                    ]                                                                              Δφ          k          u                =                              φ            ⁡                          (                                                uR                  a                                ,                k                            )                                -                      φ            ⁡                          (                                                                    (                                          u                      -                      1                                        )                                    ⁢                                      R                    a                                                  ,                k                            )                                -                                    R              a                        ⁢                          Ω              k                        ⁢                                                  ⁢                          (                                                Ω                  k                                =                                                      2                    ⁢                    π                    ⁢                                                                                  ⁢                    k                                    L                                            )                                                          (                  Expression          ⁢                                          ⁢          3                )            
Since the increment Δ φku is calculated at a time interval Ra, the audio signal processing apparatus can calculate each frequency component ω (uRa, k) according to Expression 4.
                    [                  Math          .                                          ⁢          5                ]                                                                      ω          ⁡                      (                                          uR                a                            ,              k                        )                          =                              Ω            k                    +                                                                      Δ                  p                                ⁢                                  φ                  k                  u                                                            R                a                                      ⁢                          (                                                                    Δ                    p                                    ⁢                  α                                ∈                                  [                                                            -                      π                                        ,                    π                                    )                                            )                                                          (                  Expression          ⁢                                          ⁢          4                )            
Next, the audio signal processing apparatus calculates the phase at a synthesis point uRs according to Expression 5.ψ(uRs,k)=ψ((u−1)Rs,k)+Rs·ω(uRa,k)  (Expression 5)
(3) Reconstruction
The audio signal processing apparatus calculates, for each frequency index, the amplitude |X(uRa, k)| of the frequency signal calculated by FFT and the adjusted phase ψ (uRs, k). Next, the audio signal processing apparatus reconstructs the frequency signal into a time signal using the inverse FFT. The reconstruction is executed according to Expression 6.
                    [                  Math          .                                          ⁢          6                ]                                                                                  x            ^                    ⁡                      (                                          uR                s                            ,              m                        )                          =                              ∑                          k              =              0                                      L              -              1                                ⁢                                                                  X                ⁡                                  (                                                            uR                      a                                        ,                    k                                    )                                                                    ·                          ⅇ                              jψ                ⁡                                  (                                                            uR                      s                                        ,                    k                                    )                                                      ·                          W              L                              -                mk                                      ·                          h              ⁡                              (                k                )                                                                        (                  Expression          ⁢                                          ⁢          6                )            
The audio signal processing apparatus inserts the reconstructed time block signal into the synthesis point uRs. Next, the audio signal processing apparatus generates a time-stretched signal by performing overlap addition of a current synthesized output signal and the synthesized output signal for the previous block. The overlap addition with the synthesized output of the previous block is as represented by Expression 7.[Math. 7]y(uRs+m)=y(uRs+m)+{circumflex over (x)}(uRs,m)(m=0, . . . ,L−1)  (Expression 7)
These three steps are performed also on an analysis point (u+1) Ra. These three steps are repeated for every input signal block. As a result, the audio signal processing apparatus can calculate signals each having a time stretched by a stretch rate of Rs/Ra.
Here, in order to modify modulation (temporal fluctuation) in the amplitude direction of the time-stretched signal, a window function h(m) needs to satisfy a power-complementary condition.
Examples of processing corresponding to time stretches include pitch shift processing. The pitch shift processing is a method for changing the pitch of a signal without changing the duration time of the signal. One simple method for changing the pitch of a digital audio signal is to decimate (re-sample) an input signal. The pitch shift processing can be combined with time stretch processing. For example, the audio signal processing apparatus can re-sample an input signal having a time length equal to that of the original input signal after the time stretch processing.
On the other hand, there is an approach for directly calculating the pitch in pitch shift processing. The method for calculating the pitch in pitch shift processing may produce an adverse effect more serious than that in the re-sampling on the time axis, but the details are not mentioned here.
Here, the time stretch processing may be time compression processing depending on a stretch rate. Accordingly, the term “time stretch” means “a time stretch and/or time compression” including the concept of “time compression”.