This invention relates to a voice coding apparatus for compressing a digital sound signal to a smaller information amount and a voice decoding apparatus for decoding voice code generated by the voice coding apparatus, etc., to reproduce the digital sound signal.
Most voice coding apparatus and voice decoding apparatus in related arts separate input voice into spectrum envelope information and a sound source and code them in frame units to generate voice code, then decode the voice code to combine the spectrum envelope information and the sound source through a combining filter, thereby providing decode voice.
A voice coding apparatus and a voice decoding apparatus using a code-excited linear prediction (CELP) technique are available as the most representative voice coding apparatus and voice decoding apparatus.
FIG. 15 shows the general configuration of a CELP base voice coding apparatus. In the figure, numeral 1 denotes input voice, numeral 2 denotes linear prediction analysis means, numeral 3 denotes linear prediction coefficient coding means, numeral 4 denotes adaptive sound source coding means, numeral 5 denotes drive sound source coding means, numeral 6 denotes gain coding means, numeral 7 denotes multiplexing means, and numeral 8 denotes voice code.
FIG. 16 shows the general configuration of a CELP base voice decoding apparatus. In the figure, numeral 9 denotes demultiplexing means, numeral 10 denotes linear prediction coefficient decoding means, numeral 11 denotes adaptive sound source decoding means, numeral 12 denotes drive sound source decoding means, numeral 13 denotes gain decoding means, numeral 14 denotes a combining filter, and numeral 15 denotes output voice.
The voice coding apparatus and the voice decoding apparatus in the related art perform processing in frame units with about 5 to 50 ms as a frame. The operation of the voice coding apparatus and the voice decoding apparatus in the related art is as follows:
First, in the voice coding apparatus, the input voice 1 is input to the linear prediction analysis means 2 and the adaptive sound source coding means 4. The linear prediction analysis means 2 analyzes the input voice 1 and extracts a linear prediction coefficient of voice spectrum envelope information. The linear prediction coefficient coding means 3 codes the linear prediction coefficient and outputs the code to the multiplexing means 7 and also outputs the coded linear prediction coefficient for coding a sound source.
The adaptive sound source coding means 4, in which past sound sources are previously stored as an adaptive sound source code book, prepares time-series vectors periodically repeating the past sound sources corresponding to the adaptive sound source codes. Next, the adaptive sound source coding means 4 multiplies each time-series vector by an appropriate gain and allows the result to pass through a combining filter using the coded linear prediction coefficient for providing a tentative composite tone. It examines the distance between the tentative composite tone and the input voice 1, selects an adaptive sound source code to minimize the distance, and outputs the time-series vector corresponding to the selected adaptive sound source code as the adaptive sound source. The adaptive sound source coding means 4 also outputs the input voice 1 or a signal provided by subtracting the composite tone based on the adaptive sound source from the input voice 1 to the drive sound source coding means 5 at the following stage.
The drive sound source coding means 5 first reads time-series vectors sequentially from a drive sound source code book stored in the drive sound source coding means 5 corresponding to drive sound source codes. Next, the drive sound source coding means 5 multiplies each time-series vector and the adaptive sound source by an appropriate gain, adds the results, and allows the addition result to pass through a combining filter using the coded linear prediction coefficient for providing a tentative composite tone. It uses the input voice 1 or the signal provided by subtracting the composite tone based on the adaptive sound source from the input voice 1 as a signal to be coded, examines the distance between the signal to be coded and the tentative composite tone, selects a drive sound source code to minimize the distance, and outputs the time-series vector corresponding to the selected drive sound source code as the drive sound source.
The gain coding means 6 first reads gain vectors sequentially from a gain code book stored in the gain coding means 6 corresponding to gain codes. The gain coding means 6 multiplies the adaptive sound source and the drive sound source by each element of each gain vector, adds the results, and allows the addition result to pass through a combining filter using the coded linear prediction coefficient for providing a tentative composite tone. It examines the distance between the tentative composite tone and the input voice 1 and selects a gain code to minimize the distance.
Last, the adaptive sound source coding means 4 multiplies the adaptive sound source and the drive sound source by each element of the gain vector corresponding to the selected gain code and adds the results, thereby preparing a sound source and updating the adaptive sound source code book.
The multiplexing means 7 multiplexes the linear prediction coefficient code, the adaptive sound source code, the drive sound source code, and the gain code and outputs a provided voice code 8.
In the voice decoding apparatus, the demultiplexing means 9 demultiplexes the voice code 8 into the linear prediction coefficient code, the adaptive sound source code, the drive sound source code, and the gain code.
The linear prediction coefficient decoding means 10 decodes the linear prediction coefficient from the linear prediction coefficient code and sets the linear prediction coefficient as a coefficient of the combining filter 14.
Next, the adaptive sound source decoding means 11, in which past sound sources are previously stored as an adaptive sound source code book, outputs time-series vectors periodically repeating the past sound sources corresponding to the adaptive sound source codes. The drive sound source decoding means 12 outputs the time-series vector corresponding to the drive sound source code. The gain decoding means 13 outputs the gain vector corresponding to the gain code. The two time-series vectors are multiplied by each element of the gain vector and the results are added for preparing a sound source. This sound source is made to pass through the combining filter 14 to prepare an output voice 15.
Last, the adaptive sound source decoding means 11 uses the prepared sound source to update the adaptive sound source code book.
Next, related arts intended for improving the CELP base voice coding apparatus and voice decoding apparatus will be discussed.
Document 1
KATAOKA Akitoshi, HAYASHI Shinji, MORITANI Takehiro, KURIHARA Shoko, MANO Kazunori xe2x80x9cCS-ACELP no kihon algorithmxe2x80x9d NTT RandD, Vol. 45, pp. 325-330 (April 1996) discloses CELP base voice coding apparatus and voice decoding apparatus adopting a pulse sound source for coding a drive sound source for the main purpose of reducing the operation amount and the memory amount. In the configuration in the related art, a drive sound source is represented only by several-pulse position information and polarity information. Such a sound source, which is called an algebraic sound source, has a good coding characteristic for its simple structure and has been adopted in most recent standards.
FIG. 17 is a table listing position candidates of pulse sound sources used in Document 1. In Document 1, the sound source coding frame length is 40 samples and each drive sound source consists of four pulses. The position candidates of each of the pulse sound sources with sound source numbers 1 to 3 are limited to eight positions as shown in FIG. 17, and each pulse position can be coded in three bits. The position candidates of the pulse sound source with sound source number 4 are limited to 16 positions, and the pulse position can be coded in four bits. The position candidates of the pulse sound sources are limited, whereby the number of code bits and the number of combinations can be reduced for reducing the operation amount while degradation of the coding characteristic is suppressed.
The configurations for improving the quality of the algebraic sound source are disclosed in the Unexamined Japanese Patent Application Publication No. Hei 10-232696 and
Document 2
Tadashi Amada, Kimio Miseki and Masami Akamine xe2x80x9cCELP SPEECH CODING BASED ON AN ADAPTIVE PULSE POSITION CODEBOOKxe2x80x9d 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. I, pp. 13-16 (March 1999), and
Document 3
TUCHIYA, AMADA, MISEKI xe2x80x9cTekiou pulse ichi ACELP onsei fugouka no kaizenxe2x80x9d Nihon Onkyou Gakkai 1999 shunki kenkyuu happoukai kouen ronbunshuu I, pp. 213-214.
In the Unexamined Japanese Patent Application Publication No. Hei 10-232696, a plurality of fixed waveforms are provided and are placed at algebraically coded sound source positions, thereby preparing drive sound sources. A plurality of drive sound source preparation means (noise code books) are provided and one of them is selected for use based on coding distortion or the voice analysis result. As the plurality of drive sound source preparation means, the case where they differ in the number of fixed waveforms and at least one for preparing a random number sequence and a pulse string different from the algebraic sound source are disclosed. According to the configurations, a high-quality output voice can be provided.
Document 2 indicates that the position candidates of pulse sound sources are set adaptively for each frame so that they collect where amplitude envelopes of adaptive sound sources are large in size, whereby the coding characteristic can be improved.
Document 3 corresponds to an improvement in Document 2. When a pitch filter is contained in a drive sound source (in Document 3, ACELP sound source) preparation section, there is a tendency to easily select the sound source position in the first one-pitch period section, and the position candidates of pulse sound sources are set adaptively for each frame based on the size of the amplitude envelope of the adaptive sound source undergoing pitch inverse filtering at the time.
The described related arts involve the following problems:
In the voice coding apparatus and the voice decoding apparatus disclosed in Document 1, a fixed number of position candidates for each sound source number exist for each of divisions into which a frame is equally divided, namely, are distributed equally within the frame. To make a low bit rate with the configuration intact, the number of bits must be decreased or the position candidates for each sound source number must be thinned out at equal intervals; in this case, however, abrupt characteristic degradation is incurred.
To help resolve the problem, Documents 1 and 2 disclose each an adaptive thinning-out method for suppressing the characteristic degradation. However, when the periodicity of input voice is disordered or changes, adaptive thinning out results in large characteristic degradation; this is a problem. The adaptive thinning-out processing also affects the drive sound source when an error occurs in the adaptive sound source because of a code transmission error on a communication channel; this is also a problem.
In Document 3, when a pitch filter is contained in the drive sound source preparation section, the sound source position candidates are concentrated on the first one-pitch period section, whereby an average characteristic improvement is accomplished. However, the latter half of a frame may be important in the voice rising section which is the most important in the hearing sense or the like; the latter half of the frame cannot well be represented, characteristic degradation is caused, and quality degradation is caused in the hearing impression.
In the Unexamined Japanese Patent Application Publication No. Hei 10-232696, a plurality of drive sound source preparation means (noise code books) are provided for intending improvement in the characteristic, but the position candidates themselves where fixed sound sources are placed are not novel (the same as Document 1). As in Document 1, to make a low bit rate, a problem of incurring abrupt characteristic degradation is involved.
In both Document 1 and the Unexamined Japanese Patent Application Publication No. Hei 10-232696, if the sound source positions provided as the coding result concentrate on the back of the frame, a low-amplitude section of drive sound source is produced in the first half of the frame and a discontinuous sense of amplitude is heard in a section of small amplitude of adaptive sound source such as a frictional sound; this is a problem. FIG. 18 shows an example of output voice 15 involving the discontinuous sense. Since the drive sound source top position in a frame is at a distance from the top of the frame, a low-amplitude section occurs in the vicinity of the frame top. In the Unexamined Japanese Patent Application Publication No. Hei 10-232696, a mode of coding a sound source in a random number sequence, etc., can also be provided for resolving the problem. However, a problem of losing the feature of an algebraic sound source lessening the memory amount and the operation amount is involved.
It is therefore an object of the invention to provide a voice coding apparatus and a voice decoding apparatus good in quality although a low bit rate is applied.
According to the invention, there is provided a voice coding apparatus comprising drive sound source coding means, gain coding means, and spectrum envelope information coding means, wherein an input voice is separated into spectrum envelope information and a sound source and the spectrum envelope information and the sound source are coded for each predetermined-length section called a frame, characterized in that
the spectrum envelope information coding means codes the spectrum envelope information of the input voice, that
the drive sound source coding means comprises a plurality of algebraic sound source coding means having sound source position tables different in distribution lean of sound source position candidates in a frame, each algebraic sound source coding means for referencing the spectrum envelope information and coding the sound source of the input voice based on a sound source position selected from among the sound source position candidates in the sound source position table and a polarity and selection means for selecting the algebraic sound source coding means with the smallest coding distortion from among the plurality of algebraic sound source coding means and outputting selection information, code representing the drive sound source position output by the selected algebraic sound source coding means, and polarity, and that
the gain coding means selects gain code based on the drive sound source and the spectrum envelope information.
In the voice coding apparatus according to the invention, at least one of the plurality of algebraic sound source coding means comprises the sound source position table having the sound source position candidates distributed leaning to the forward part of the current frame.
In the voice coding apparatus according to the invention, at least one of the plurality of algebraic sound source coding means comprises the sound source position table having the sound source position candidates distributed leaning to the backward part of the current frame.
According to the invention, there is provided a voice coding apparatus comprising drive sound source coding means, gain coding means, and spectrum envelope information coding means, wherein an input voice is separated into spectrum envelope information and a sound source and the spectrum envelope information and the sound source are coded for each predetermined-length section called a frame, characterized in that
the spectrum envelope information coding means codes the spectrum envelope information of the input voice, that
the drive sound source coding means comprises a plurality of algebraic sound source coding means for coding the sound source of the input voice based on a sound source position selected from among sound source position candidates and a polarity and selection means for selecting one from among the plurality of algebraic sound source coding means and outputting selection information, code representing the sound source position output by the selected algebraic sound source coding means, and a polarity, wherein at least one of the plurality of algebraic sound source coding means selects one or more sound source positions from within the range of a small number of samples starting at the frame top, and that the gain coding means selects gain code based on the drive sound source and the spectrum envelope information.
According to the invention, there is provided a voice coding apparatus comprising drive sound source coding means, gain coding means, and spectrum envelope information coding means, wherein an input voice is separated into spectrum envelope information and a sound source and the spectrum envelope information and the sound source are coded for each predetermined-length section called a frame, characterized in that
the spectrum envelope information coding means codes the spectrum envelope information of the input voice, that
the drive sound source coding means comprises a plurality of algebraic sound source coding means for coding the sound source of the input voice based on a sound source position selected from among sound source position candidates and a polarity and selection means for selecting one from among the plurality of algebraic sound source coding means and outputting selection information, code representing the sound source position output by the selected algebraic sound source coding means, and a polarity, wherein the plurality of algebraic sound source coding means differ in sound source position candidates and the position candidates for one sound source in at least one sound source position candidate are limited within the range of a small number of samples starting at the frame top, and that
the gain coding means selects gain code based on the drive sound source and the spectrum envelope information.
In the voice coding apparatus according to the invention, the selection means selects the algebraic sound source coding means based on a predetermined parameter representing an input voice feature.
In the voice coding apparatus according to the invention, as the predetermined parameter in the selection means, the spectrum envelope information output by the voice coding apparatus provided before the operation of the selection means is used and the selection means outputs only the code representing the sound source position and the polarity.
According to the invention, there is provided a voice coding apparatus comprising drive sound source coding means, gain coding means, and spectrum envelope information coding means, wherein an input voice is separated into spectrum envelope information and a sound source and the spectrum envelope information and the sound source are coded for each predetermined-length section called a frame, characterized in that
the spectrum envelope information coding means codes the spectrum envelope information of the input voice, that
the drive sound source coding means is algebraic sound source coding means for coding the sound source based on a sound source position selected from among sound source position candidates and a polarity and makes a search with a limitation imposed on sound source position combinations only if a predetermined parameter representing an input voice feature satisfies a predetermined condition, and that
the gain coding means selects gain code based on the drive sound source and the spectrum envelope information.
In the voice coding apparatus according to the invention, the limitation imposed on the sound source position combinations is that one or more sound source positions should exist in the range of a small number of samples starting at the frame top.
In the voice coding apparatus according to the invention, the limitation imposed on the sound source position combinations is that when a frame is equally divided into as many divisions as the number of pulses, one pulse should always be contained in each division.
In the voice coding apparatus according to the invention, the range of a small number of samples is only the frame top.
According to the invention, there is provided a voice decoding apparatus comprising drive sound source decoding means, gain decoding means, spectrum envelope information decoding means, and a combining filter, wherein voice code separated into spectrum envelope information and a sound source which are coded is decoded for each predetermined-length section called a frame, characterized in that the spectrum envelope information decoding means decodes the spectrum envelope information from the voice code and sets a coefficient of the combining filter, that
the drive sound source decoding means comprises a plurality of algebraic sound source decoding means having sound source position tables different in distribution lean of sound source position candidates in a frame, each algebraic sound source coding means for selecting a sound source position among sound source position candidates based on code representing a sound source position in the voice code and decoding the sound source using the sound source position and a polarity, and switch means for outputting the code representing the sound source position in the voice code and the polarity to one of the plurality of algebraic sound source decoding means, that
the gain decoding means outputs a gain vector corresponding to gain code and multiplies the sound source by the gain vector, and that
the combining filter uses the coefficient set by the spectrum envelope information decoding means to prepare an output voice from the sound source multiplied by the gain vector.
In the voice decoding apparatus according to the invention, at least one of the plurality of sound source position candidates that the plurality of algebraic sound source decoding means have is distributed leaning to the forward part of the current frame.
In the voice decoding apparatus according to the invention, at least one of the plurality of sound source position candidates that the plurality of algebraic sound source decoding means have is distributed leaning to the backward part of the current frame.
According to the invention, there is provided a voice decoding apparatus comprising drive sound source decoding means, gain decoding means, spectrum envelope information decoding means, and a combining filter, wherein voice code separated into spectrum envelope information and a sound source which are coded is decoded for each predetermined-length section called a frame, characterized in that
the spectrum envelope information decoding means decodes the spectrum envelope information from the voice code and sets a coefficient of the combining filter, that
the drive sound source decoding means comprises a plurality of algebraic sound source decoding means each for selecting a sound source position among sound source position candidates based on code representing a sound source position in the voice code and decoding the sound source using the sound source position and a polarity, and switch means for outputting the code representing the sound source position in the voice code and the polarity to one of the plurality of algebraic sound source decoding means, wherein the plurality of algebraic sound source decoding means differ in sound source position candidates and the position candidates for one sound source in at least one sound source position candidate are limited within a predetermined range of a small number of samples starting at the frame top, that
the gain decoding means outputs a gain vector corresponding to gain code and multiplies the sound source by the gain vector, and that
the combining filter uses the coefficient set by the spectrum envelope information decoding means to prepare an output voice from the sound source multiplied by the gain vector.
In the voice decoding apparatus according to the invention, the predetermined range of a small number of samples is only the frame top.
In the voice decoding apparatus according to the invention, the received voice code contains selection information and the switch means outputs the code representing the sound source position in the voice code and the polarity to one of the plurality of algebraic sound source decoding means based on the selection information.
In the voice decoding apparatus according to the invention, the switch means finds selection information based on the received voice code or the decoding result and outputs the code representing the sound source position in the voice code and the polarity to one of the plurality of algebraic sound source decoding means based on the selection information.