Currently, in the speech coding field, at a low bit-rate not more than 10 kbps, Code Excited Linear Prediction Coding Encoding (hereinafter, referred to as “CELP”) is widely used (refer to Non-Patent Document 1). The CELP coding performs modeling of a speech generating mechanism of the human being by a sound source component (vocal cord) and a spectrum envelope component (vocal tract) and encodes parameters thereof.
On the encoding side, the speech is divided on the basis of a frame unit, and frames are encoded. The spectrum envelope component is calculated with an AR model (Auto-Regressive model) of the speech based on linear prediction, and is given as a Linear Prediction Coding (hereinafter, referred to as “LPC”) coefficient. Further, the sound source component is given as a prediction residual. The prediction residual is separated into period information indicating pitch information, noise information serving as sound source information, and gain information indicating a mixing ratio of the pitch and the sound source. The information comprises code vectors stored in a code book. The code vector is determined by a method for passing code vectors through a filter to synthesize a speech and searching one of the speeches having the most approximate input waveform, i.e., closed loop search using AbS (Analysis by Synthesis) method.
Further, on the decoding side, the encoded information is decoded, and the LPC coefficient, the period information (pitch information), noise sound source information, and the gain information are restored. The pitch information is added to the noise information, thereby generating an excitation source signal. The excitation source signal passes through a linear-prediction synthesizing filter comprising the LPC coefficient, thereby synthesizing a speech.
FIG. 16 is a diagram showing an example of the basic structure of a speech coding apparatus using the CELP coding (Refer to Patent Document 1 and FIG. 9).
An original speech signal is divided on the basis of a frame unit having a predetermined number of samples, and the divided signals are input to an input terminal 101. A linear-prediction coding analyzing unit 102 calculates the LPC coefficient indicating a frequency spectrum envelope characteristic of the original speech signal input to the input terminal 101. Specifically speaking, an autocorrelation function of the frame is obtained and the LPC coefficient is calculated with Durbin recursive solution.
An LPC coefficient encoding unit 103 quantizes and encodes the LPC coefficient, thereby generating the LPC coefficient. The quantization is performed with transformation of the LPC coefficient into a Line Spectrum Pair (LSP) parameter, a Partial auto-Correlation (PARCOR) parameter, or a reflection coefficient having high quantizing efficiency in many cases. An LPC coefficient decoding unit 104 decodes the LPC coefficient code and reproduces the LPC coefficient. Based on the reproduced LPC coefficient, the code book is searched so as to encode a prediction residual component (sound source component) of the frame. The code book is searched on the basis of a unit (hereinafter, referred to as a “subframe”) obtained by further dividing the frame in many cases.
Herein, the code book comprises an adaptive code book 105, a noise code book 106, and a gain code book 107.
The adaptive code book 105 stores a pitch period and an amplitude of a pitch pulse as a pitch period vector, and expresses a pitch component of the speech. The pitch period vector has a subframe length obtained by repeating a residual component (drive sound source vector corresponding to just-before one to several frames quantized) until previous frames for a preset period. The adaptive code book 105 stores the pitch period vectors. The adaptive code book 105 selects one pitch period vector corresponding to a period component of the speech from among the pitch period vectors, and outputs the selected vector as a candidate of a time-series code vector.
The noise code book 106 stores a shape excitation source component indicating the remaining waveform obtained by excluding the pitch component from the residual signal, as an excitation vector, and expresses a noise component (non-periodical excitation) other than the pitch. The excitation vector has a subframe length prepared as white noise as the base, independently of the input speech. The noise code book 106 stores a predetermined number of the excitation vectors. The noise code book 106 selects one excitation vector corresponding to the noise component of the speech from among the pitch excitation vectors, and outputs the selected vector as a candidate of the time-series code vector corresponding to a non-periodic component of the speech.
Further, the gain code book 107 expresses gain of the pitch component of the speech and a component other than this.
Gain units 108 and 109 multiply pitch gain ga and shape gain gr of the candidates of the time-series code vectors input from the adaptive code book 105 and the noise code book 106. The gains ga and gr are selected and output by the gain code book 107. Further, an adding unit 110 adds both the gain and generates a candidate of the drive sound source vector.
A synthesizing filter 111 is a linear filter that sets the LPC coefficient output by the LPC coefficient decoding unit 104 as a filter coefficient. The synthesizing filter 111 performs filtering of the candidate of the drive sound source vector output from the adding unit 110, and outputs the filtering result as a reproducing speech candidate vector.
A comparing unit 112 subtracts the reproducing speech candidate vector from the original speech signal vector, and outputs distortion data. The distortion data is weighted by an auditory weighting filter 113 with a coefficient corresponding to the property of the sense of hearing of the human being. In general, the auditory weighting filter 113 is a moving-average autoregressive filter of a tenth-order, and relatively emphasizes a peak portion of formant. The weighting is performed for the purpose of encoding to reduce quantizing noises within a frequency band at the bottom having a small value of the speech spectrum envelop.
A distance minimizing unit 114 selects a period signal, noise code, and gain code, having the minimum squared error of the distortion data output from the auditory weighting filter 113. The period signal, noise code, and gain code are individually sent to the adaptive code book 105, the noise code book 106, and the gain code book 107. The adaptive code book 105 outputs the candidate of the next time-series code vector based on the input period signal. The noise code book 106 outputs the candidate of the next time-series code vector on the basis of the input noise signal. Further, the gain code book 107 outputs the next gains ga and gr based on the input gain code.
The distance minimizing unit 114 determines, as the drive sound source vector of the frame, the period signal, noise code, and gain code at the time for minimizing the distortion data output from the auditory weighting filter 113 by repeating this AbS loop.
A code sending unit 115 converts the period signal, noise code, and gain code determined by the distance minimizing unit 114 and the LPC coefficient code output from the LPC coefficient encoding unit 103 into bit-series code, and further adds correcting code as needed and outputs the resultant code.
FIG. 17 shows an example of the basic structure of a speech decoding apparatus using the CELP encoding (refer to Patent Document 1 and FIG. 11).
The speech decoding apparatus has substantially the same structure as that of the speech coding apparatus, except for no-search of the code book. A code receiving unit 121 receives the LPC coefficient code, period code, noise code, and gain code. The LPC coefficient code is sent to an LPC coefficient decoding unit 122. The LPC coefficient decoding unit 122 decodes the LPC coefficient code, and generates the LPC coefficient (filter coefficient).
The adaptive code book 123 stores the pitch period vectors. The pitch period vector has a subframe length obtained by repeating the residual component (drive sound source vector corresponding to just-before one to several frames decoded) until previous frames for a preset period. The adaptive code book 123 selects one pitch period vector corresponding to the period code input from the code receiving unit 121, and outputs the selected vector as the time-series code vector.
The noise code book 124 stores excitation vectors. The excitation vectors have a subframe length prepared based on white noise, independent of the input speech. One of the excitation vectors is selected in accordance with the noise code input from the vector code receiving unit 121, and the selected vector is output as a time-series code vector corresponding to a non-periodic component of the speech.
Further, the gain code book 125 stores gain (pitch gain ga and shape gain gr) of the pitch component of the speech and another component. The gain code book 125 selects and outputs a pair of the pitch gain ga and shape gain gr corresponding to the gain code input from the code receiving unit 121.
Gain units 126 and 127 multiply the pitch gain ga and shape gain gr of the time-series code vectors output from the adaptive code book 123 and the noise code book 124. Further, an adding unit 128 adds both the gain and generates a drive sound source vector.
A synthesizing filter 129 is a linear filter that sets the LPC coefficient output by the LPC coefficient decoding unit 122, as a filter coefficient. The synthesizing filter 129 performs filtering of the candidate of the drive sound source vector output from the adding unit 128, and outputs the filtering result as a reproducing speech to a terminal 130.
MPEG standard and audio devices widely use subband coding. With the subband coding, a speech signal is divided into a plurality of a frequency bands (subbands), and a bit is assigned in accordance with signal energy in the subband, thereby efficiently performing the coding. As a technology for applying the subband coding to the speech coding, technologies disclosed in Patent Documents 2 to 4 are well-known.
With the speech coding disclosed in Patent Documents 2 to 4, the speech signal is basically encoded by the following signal processing.
First, the pitch is extracted from an input original speech signal. Then, the original speech signal is divided into pitch intervals. Subsequently, the speech signals at the pitch intervals obtained by the division are resampled so that the number of samples at the pitch interval is constant. Further, the resampled speech signal at the pitch interval is subjected to orthogonal transformation such as DCT, thereby generating subband data comprising (n+1) pieces of data. Finally, the (n+1) pieces of data obtained on time series are subjected to filtering, thereby removing the component having a frequency over a predetermined one in the time-based change in intensity to smooth the data and generating (n+1) pieces of data on acoustic information. Further, the ratio of a high-frequency component is determined on the basis of a threshold from the subband data, thereby determining whether or not the original speech signal is friction sound and outputting the determining result as information on the friction sound.
Finally, the original speech signal is divided into information (pitch information) indicating the original pitch length at the pitch interval, acoustic information containing the (n+1) pieces of acoustic information data, and fricative information, and the divided information is encoded.
FIG. 18 is a diagram showing an example of the structure of a speech coding apparatus (speech signal processing apparatus) disclosed in Patent Document 2. The original speech signal (speech data) is input to a speech data input unit 141. A pitch extracting unit 142 extracts a basic-frequency signal (pitch signal) at the pitch from the speech data input to the speech data input unit 141, and segments the speech data by a unit period (pitch interval as one unit) of the pitch signal. Further, the speech data at the pitch interval as the unit is shifted and adjusted so as to maximize the correlation between the speech data and the pitch signal, and the adjusted data is output to the pitch-length fixing unit 143.
A pitch-length fixing unit 143 resamples the speech data at the pitch interval as the unit so as to substantially equalize the number of samples at the pitch interval as the unit. Further, the resampled speech data at the pitch interval as the unit is output as pitch waveform data. Incidentally, the resampling removes information on the length (pitch period) of the pitch interval as the unit and the pitch-length fixing unit 143 therefore outputs information on the original pitch length at the pitch interval as the unit, as the pitch information.
A subband dividing unit 144 performs orthogonal transformation, such as DCT, of the pitch waveform data, thereby generating subband data. The subband data indicates time-series data containing (n+1) pieces of spectrum intensity data, indicating the intensity of a basic frequency component of the speech and n intensities of high-harmonic components of the speech.
A band information limiting unit 145 performs filtering of the (n+1) pieces of spectrum intensity data forming the subband data, thereby removing a component having a frequency over a predetermined one during the time-based change in the (n+1) pieces of spectrum intensity data. This is processing performed to remote the influence of the aliasing generated as a result of the resampling by the pitch-length fixing unit 143.
The subband data filtered by the band information limiting unit 145 is nonlinearly quantized by a non-linear quantizing unit 146, is encoded by a dictionary selecting unit 147, and is output as the acoustic information.
A friction sound detecting unit 149 determines, based on the ratio of the high-frequency components to all spectrum intensities of the subband data, whether the input speech data is voiced sound or unvoiced sound (friction sound). Further, the friction sound detecting unit 149 outputs friction sound information as the determining result.
As mentioned above, the fluctuation of the pitch is removed before dividing the original speech signal into the subband, and the orthogonal transformation is performed every pitch interval, thereby dividing the signal into subbands. Accordingly, since the time-based change in spectrum intensity of the subband is small, a high compressing-rate is realized with respect to the acoustic information.
[Patent Document 1]
Japanese Patent Publication No. 3199128
[Patent Document 2]
Japanese Unexamined Patent Application Publication No. 2003-108172
[Patent Document 3]
Japanese Unexamined Patent Application Publication No. 2003-108200
[Patent Document 4]
Japanese Unexamined Patent Application Publication No. 2004-12908
[Non-Patent Document 1]
Manfred R. Schroeder and Bishnu S. Atal, “Code-excited Linear Prediction (CELP): High-Quality Speech at Very Low Bit Rates”, Proceedings of ICASSP '85, pp. 25.1.1 to 25.1.4, 1985.
[Non-Patent Document 2]
Hitoshi KIYA, “Multirate Signal Processing in Series of Digital Signal Processing (Volume 14)”, first edition, Oct. 6, 1995, pp. 34 to 49 and 78 to 79.