Transform coding technology is designed to code audio signals efficiently. The fundamental frequency of the signal representing human speech varies sometimes. This causes the energy of a speech signal to spread out to wider frequency bands. It is not efficient to code a pitch-varying speech signal using a transform codec, especially in low bitrate. The time-warping technique is used in conventional techniques to compensate effects of variation of pitch as disclosed in NPL 3 [3] and PTL 1 [4], for example.
FIG. 10 illustrates an example of the idea of shifting the fundamental frequency.
The time-warping technique is used for the pitch shifting. In FIG. 10, (a) illustrates an original spectrum and (b) illustrates the spectrum after pitch shifting.
In (b) of FIG. 10, the fundamental frequency is shifted from 200 Hz to 100 Hz. By shifting the pitch of the next frame to align with the pitch of previous frame, the pitch is made consistent.
FIG. 11 illustrates the spectrum after pitch shifting.
The energy of the signal converges as shown in FIG. 11.
In FIG. 11, (a) illustrates a sweep signal and (b) illustrates the signal after pitch shifting. The pitch shown in (b) is constant.
In FIG. 11, (c) illustrates the spectrum of the signal shown in (a) and the spectrum of the signal shown in (b). As shown in (c) of FIG. 11, the energy of the signal (b) is confined to a narrow bandwidth.
The pitch shifting is achieved using a re-sampling method. In order to maintain a consistent pitch, the re-sampling rate varies according to the pitch change rate. For an input frame, a pitch contour of this frame is obtained by applying a pitch tracking algorithm.
FIG. 8 illustrates segmentation of one audio frame.
A frame is segmented into small sections for pitch tracking as shown in FIG. 8. The adjacent sections may overlap with each other. For example, in at least one combination of sections, (part of) one section of two adjacent sections may overlap with (part of) the other section.
Currently, there are pitch tracking algorithms based on auto-correlation disclosed in NPL [1], and pitch detection methods based on the frequency domain disclosed in NPL [2].
Each of the sections has a corresponding pitch value.
FIG. 15 illustrates calculation of a pitch contour.
In FIG. 15, (a) illustrates a signal with time-varying pitch. One pitch value is calculated from a section of the signal. A pitch contour is a concatenation of the pitch values.
During time warping, the re-sampling rate is in proportion to the pitch change rate.
Pitch change information is extracted from the pitch contour.
Cents and semitones are often used to measure the pitch change rate.
FIG. 12 shows the measurement of the cents and semitones. A cent is calculated from a pitch ratio between adjacent pitches:
                    cent        =                  1200          ×                      log            2                    ⁢                                                    pitch                ⁡                                  (                                      i                    +                    1                                    )                                                            pitch                ⁡                                  (                  i                  )                                                      .                                              [                  Eq          .                                          ⁢          1                ]            
Re-sampling is performed on a time domain signal according to the pitch change rate. Pitches of other sections are shifted to the reference pitch to be a consistent pitch. For example, when a pitch of a section is higher than a pitch of the previous pitch, the re-sampling rate is set to lower in proportion to the difference in cents between the two pitches. When a pitch of a section is not higher, the sampling rate needs to be higher.
With a recording player which allows audio playback speed adjustment, higher tone is shift to lower frequency by lowing down the playing speed. This is similar to the idea of re-sampling a signal in proportion to the pitch change rate.
FIG. 13 and FIG. 14 illustrate a coding system in which a time-warping scheme is integrated.
FIG. 13 is a block diagram of time warping in an encoder (an encoder 13A).
FIG. 14 is a block diagram of time warping in a decoder (a decoder 14A).
The time domain signal is warped before transform encoding. Pitch information is necessary for the decoder to perform reverse time warping. Therefore, pitch ratios need be encoded by the encoder.
In the conventional techniques, a small fixed table is used for coding the pitch ratio information. Small bits are used for coding the pitch ratios. However, such a small table has limitation, so that the performance of time warping deteriorates when the signal has a large pitch change rate.
On the other hand, a large table requires more bits, and bits left for transform coding is insufficient, and therefore sound quality also deteriorates. Currently, the effect of the time warping using a fixed table is limited. The above processes (such as coding) are, for example, the processes which are the same as the processes to be specified by the standards of the International Organization for Standardization (ISO), which will be described in detail below.