1. Field of the Invention
The present invention relates to voice analysis/synthesis apparatus that analyzes a voice waveform and synthesizes a voice waveform using a result of the analysis, and programs for control of the voice waveform analysis/synthesis.
2. Description of the Related Art
Some of voice analysis/synthesis apparatus that analyze a voice waveform and synthesize another voice waveform using result of the analysis analyze the frequencies of the former voice waveform as its analysis. In such apparatus, synthesis of a voice waveform mainly comprises analysis, modification and synthesis processes, which will be described specifically.
<Analysis Process>
A voice waveform is sampled at predetermined intervals of time. A predetermined number of sampled waveform values constitute a frame which is then subjected to short-time Fourier transform (STFT), thereby extracting a frequency component for each different frequency channel. The frequency component includes a real part and an imaginary part. The frequency amplitude (or formant component) and phase of each frequency channel are calculated from its frequency component. STFT comprises extracting signal data for a short time and performing discreet Fourier transform (DFT) on the extracted signal data. Thus, the DFT is used as including STFT. As DFT, Fast Fourier transform (FFT) is generally used.
Pitch scaling including shifting a pitch of the voice waveform is performed after the extracted frame is interpolated/extrapolated or thinned out, and then resulting data is subjected to FFT.
<Modification Process>
Since DFT (or FFT) of the voice waveform is performed in units of a frame, a synthesized voice waveform is also obtained in units of a frame. Phase θ′ i,k of frequency channel k in the synthesized voice waveform is calculated in a following expression (1). When only time scaling including changing a voice duration time is performed, the frequency amplitude of each frequency channel need not be changed.θ′ i,k=θ′ i−1,k+ρ·ΔΘi,k  (1)where ΔΘi,k represents a phase difference in the frequency channel k between the present and preceding frames of the voice waveform, and ρ represents a scaling factor indicative of an extent of pitch scaling. Subscript i represents a frame. The present and preceding frames are represented by i and i−1, respectively. Thus, expression (1) indicates that phase θ′ i,k of frequency channel k in the present frame of the synthesized voice waveform is calculated by adding the product of phase difference ΔΘi,k and factor ρ to the phase of the frequency channel of the preceding frame in the synthesized voice waveform section (or the accumulated phase difference converted according to scaling factor ρ).
Phase difference Δ θ i,k need be unwrapped. In the voice waveform synthesis, unwrapping and wrapping the phase have an important meaning, which will be described below in detail. In order to easily recognize whether a phase is wrapped or unwrapped, the wrapped and unwrapped phases are represented by lower-case and capital letters θ and Θ, respectively.
Phase θ k,t of any channel k at any particular time t is represented byθk,t=∫0tωk (τ)d τ+θk,0  (2)
As will be obvious from expression (2), phase θ k,t is obtained by integrating an angular velocity ωk. A value obtained as the arctan when the phase is calculated based on the frequency component calculated by DFT is limited to between −π and π, or obtained as a wrapped phase θ k,t. Thus, a term of 2nπ is missing which is contained in phase Θk,t represented byΘk,t=θk,t+2nπ where n=0, 1, 2,  (3)
In order to calculate phase θ′ k,t from expression (1), wrapped phase need be unwrapped, which is work for presuming n in expression (3) and presumable based on the central frequency of channel k of DFT.Δθi,k=θi,k−θi−1,k  (4)where Δθi,k in expression (4) indicates a phase difference in the wrapped phase θi,k of channel k between adjacent frames. Central frequency Ωi,k (or angular velocity) of channel k is obtained byΩi,k=(2π·fs|N)·k  (5)where fs is a sampling frequency and N is DFT's order. Phase difference Δ Z i,k is calculated fromΔ Zi,k=Ωi,k−Δt  (6)where Δt is the difference in time between the present and preceding frames at frequency Ωi,k. Time difference Δ t itself is obtained fromΔt=N|(fs·OVL)  (7)where OVL in expression (7) represents an overlap factor that comprises a value obtained by dividing the frame size by a hop size (or the number of sampling operations corresponding to a discrepancy between adjacent frames).
Expression (6) indicates that the phase is unwrapped, and can be expressed asΔ Zi,k=Δ ζ i,k+2nπ  (8)Let δ (=Δ θ i,k−Δ ζ i,k) be a difference between a phase difference Δθ i,k calculated in expression (4) and a phase difference Δ ζ i,k in expression (8). Then
                                          Δ            ⁢                                                  ⁢                                          θ                                  i                  ,                  k                                            ·                              Ω                                  i                  ,                  k                                            ·              Δ                        ⁢                                                  ⁢            t                    =                                    (                                                Δ                  ⁢                                                                          ⁢                                      ζ                                          i                      ,                      k                                                                      +                δ                            )                        -                          (                                                Δ                  ⁢                                                                          ⁢                                      ζ                                          i                      ,                      k                                                                      +                                  2                  ⁢                  n                  ⁢                                                                          ⁢                  π                                            )                                      ⁢                                  ⁢                                  =                  δ          -                      2            ⁢            n            ⁢                                                  ⁢            π                                              (        9        )            
Thus, δ can be calculated by deleting the right term, 2n π, of expression (9) and limiting the range of expression (9) to between −π and π, and represents an actual phase difference detected in the original voice waveform.
By adding phase difference Δ Z i,k (=Ωi,k·Δt) to the actual phase difference δ, a phase difference Δ Θi,k can be obtained which is phase unwrapped as follows:ΔΘi,k=δ+Ωi,k·Δt=δ+(Δ ζ i,k+2nπ)=Δ θ i,k+2nπ  (10)
Time-scaled phase θ′ i,k is calculated from expressions (1) and (10). Note that in the method of phase wrapping based on the central frequency of the channel, actual phase difference δ need be |δ|<π. Since the absolute value of a maximum value δmax is a limit value over which no signal transfers to a next channel,
                                                                                      δ                max                                                    =                                                                                                      (                                              2                        ⁢                                                                                                  ⁢                                                  π                          ·                                                      fs                            /                            N                                                                                              )                                        ·                                          (                                              k                        +                        0.5                                            )                                        ·                    Δ                                    ⁢                                                                          ⁢                  t                                -                                                                            (                                              2                        ⁢                                                  π                          ·                                                      fs                            /                            N                                                                                              )                                        ·                    k                    ·                    Δ                                    ⁢                                                                          ⁢                  t                                            ⁢                                                          ⁢                                                          =                                                (                                      2                    ⁢                                          π                      ·                                              fs                        /                        2                                                              ⁢                    N                                    )                                ·                                  (                                                            N                      /                      fs                                        ·                    OVL                                    )                                                              )                =                  π          /          OVL                                    (        11        )            
The value of overlap factor OVL is OVL>1 based on expression (11) and a relationship |δ|<π. Thus, it will be known that the frames need be overlapped for phase unwrapping.
In DFT, a signal in one channel generally excites a plurality of other channels. Then, when a complex sinusoidal wave fn having an amplitude of 1, a normalized angular frequency ω and an initial phase φ is not applied as a window function (or when a square window is applied as a window function), the DFT is given by
                              F          k                =                                            sin              ⁢                                                N                  ⁢                                                                          ⁢                  ϖ                                2                                                    sin              ⁢                              ϖ                2                                              ⁢                      ⅇ                                          -                j                            ⁢                              {                                                                            (                                              N                        -                        1                                            )                                        ⁢                                          ϖ                      2                                                        -                  ϕ                                }                                              ⁢                                          ⁢                      (                          ϖ              =                                                -                  ω                                +                                                                            2                      ⁢                      π                                        N                                    ⁢                  k                                                      )                                              (        12        )            
The complex sinusoidal wave fn can be expressed asƒn=ej(ωn+φ) 
It will be understood from expression (12) that all the channels whose angular frequencies are other than the angular frequency ω=2π|N) ·k are excited. Since some window function is usually used, the number of channels excited depending on the bandwidth of that window function changes. When a Hanning window is used as the window function, the DFT value is given byW0=(½)N, W1=−(¼)N, W−1=−(¼)N  (13)
This is then wrapped into each channel. As will be obvious from expression (13), even when the angular frequency is ω=(2π|N)·k, three channels are excited at a ratio in frequency amplitude value of 1:2:1. When the angular frequency ω is between those in adjacent channels, four channels are excited at a ratio in frequency amplitude value of 1:5:5:1.
In order to unwrap the phase correctly in every channel to be excited, n in expression (8) must have the same value in all the channels to be excited. This restriction requires that when a Hanning window is applied as a window function to the frame, the value of overlap factor OVL need be 4 or more.
In the above analysis process, a frame is extracted in accordance with overlap factor OVL having such value, and the window function is applied to the frame, which is then subjected to FFT. In the modification process, the phase of the channel calculated as above is maintained while the frequency amplitude of each channel is operated as required.
<Synthesize Process>
In the synthesis process, the frequency component modified (or operated) in the modification process is restored to a signal on the time coordinate by IFFT (Inverse Fast Fourier Transform), thereby producing a synthesized voice waveform section for one frame, which is then caused to overlap with the preceding-frame waveform section depending on a value of overlap factor OVL that will be changed in accordance with the value of factor ρ, thereby producing a synthesized, pitch-scaled and time-scaled voice waveform.
With the conventional voice analysis/synthesis apparatus that obtains a synthesized voice waveform in the manner mentioned above, a synthesized sound involving the synthesized voice waveform will undesirably give a listener an impression of phase discrepancy, called phasiness or reverberant against an original sound based on the original sound waveform. More particularly, this phase discrepancy will cause the listener to feel that a source of the synthesized sound is remoter than that of the original sound, thereby exerting a bad influence undesirably on the listener's auditory sense. This will occur even when the pitch shift is very small. Now, this will be described in detail next.
As described above, the frames need be overlapped to unwrap the phase correctly. If to this end an appropriate value is set to the overlapping factor OVL to be used, the phase can be unwrapped correctly. Thus, the second term of the right side of expression (1) ensures that the phase θ′ i,k calculated from expression (1) always has coherence concerning a phase on the time base. Hereinafter, coherence of phase θ′ i,k on the time base is referred to as HPC (Horizontal Phase Coherence) whereas coherence of phase between channels or frequency components is referred to as VPC (Vertical Phase Coherence).
The conventional voice analysis/synthesis apparatus gives the listener the impression of phase discrepancy because the VPC is not preserved. The causes why the VPC is not preserved is that the first term of the right side of expression (1) cannot have a correct value. Let a phase unwrapping factor be n. Then, expression (1) can be modified as follows, using expressions (4) and (10):θ′i,k=θ′i−1,k+ρ(θi,k−θi−1,k+2nπ)  (14)
Now, assume that the value of scaling factor ρ is an integer. Then, a phase unwrapping term of 2nπ included in the right side of expression (14) is deletable and expression (14) can be expressed as:
                              θ                      i            ,            k                    ′                =                                            θ                                                i                  -                  1                                ,                k                            ′                        +                          ρ              ⁡                              (                                                      θ                                          i                      ,                      k                                                        -                                      θ                                                                  i                        -                        1                                            ,                      k                                                                      )                                              =                                                    θ                                  0                  ,                  k                                ′                            +                              ρ                ⁢                                                      ∑                                          j                      =                      1                                        i                                    ⁢                                      (                                                                  θ                                                  j                          ,                          k                                                                    -                                              θ                                                                              j                            -                            1                                                    ,                          k                                                                                      )                                                                        =                                          θ                                  0                  ,                  k                                ′                            +                              ρ                ⁡                                  (                                                            θ                                              i                        ,                        k                                                              -                                          θ                                              0                        ,                        k                                                                              )                                                                                        (        15        )            
If initial phase θ′ o,k is set to ρ θ′o,k, expression (15) is expressed as:θ′i,k=ρθ′i,k  (16)
Thus, the first term of the right side of expression (1) is erased. Hence, both HPC and VPC are preserved, thereby bringing about scaling giving no impression of phase discrepancy. However, if scaling factor ρ has a value other than an integer, the first term of the right side of expression (1) will remain.
The first term of the right side of expression (1) comprises an accumulated converted value (=ρ·ΔΘi,k) of the phase difference unwrapped. In order to continue to maintain the converted value at a correct value, it is necessary to appropriately cope with the following points appropriately:
1) Influence of the initial phase value,
2) Transition of a frequency component between channels, and
3) Disappearance/production of a frequency component.
With reference to point 1), the accumulated converted value can be maintained at a correct value by setting initial phase θ′ o,k to ρ θ′ o,k as described above.
With reference to point 2), if (a) a channel in which the frequency component is present is tracked, using the method of picking a peak one of the frequency amplitudes, (b) it is detected that the frequency component has transited from its present channel to another channel, and then (c) a phase difference over channels is calculated, the accumulated converted value can be maintained at a correct value. When the frequency component (or signal) has transited from channel k to channel k+1, expression (14) can be modified as:θ′i,k+1=θ′i−1,k+ρ(θi,k+1−θi−1,k+2nπ)  (17)
Phase unwrapping factor n is also calculated using phase Ωi,k+1. When tracking the transition of the frequency component fails, the accumulated converted value at this time would be inaccurate, thereby not maintaining the VPC. When transition of a frequency component between channels occurs in a frame, a situation can occur in which there is no channel in the immediately preceding frame corresponding to the channel in the present frame from which the transition of the frequency component occurred. In this case, an accurate accumulated converted value cannot be obtained due to channel discrepancy.
With reference to point 3), the disappearance/production of the frequency components are considered as inevitable in general voices and/or musical sounds excluding special voices whose waveforms comprise, for example, standing ones. Since disappearance/production of frequency components will occur randomly and very often, especially in noise having no harmonic structure, it is materially impossible to detect and hence avoid them.
Thus, maintaining VPC is materially impossible excluding that the value of scaling factor ρ is an integer in the conventional voice analysis/synthesis apparatus. Hence, it is impossible to surely avoid synthesis of a voice waveform that will give an impression of phase discrepancy. Therefore, it has been desired to surely avoid synthesis of a voice waveform that will give the impression of phase discrepancy.
In the voice analysis/synthesis apparatus disclosed in Japanese Patent 2753716 publication, the phase of a pitch-changed synthesized voice waveform is controlled in accordance with an extent of frame overlapping, which is performed in the synthesis process. The reason why the accumulated converted value, or first term of the right side of expression (1), cannot have a correct value is that that phase control is performed.