1.Field of the Invention
The present invention relates to a speech coding apparatus in which pieces of speech information are coded to digital signals having a small information volume and the digital signals are transmitted and decoded to perform an efficient data transmission. Also, the present invention relates to a linear prediction coefficient analyzing apparatus in which a digital speech signal having an analyzing time-length is analyzed to obtain a linear prediction coefficient used in the speech coding apparatus. Also, the present invention relates to a noise reducing apparatus in which noise existing in speech information is reduced at a moderate degree before the speech information is coded in the speech coding apparatus.
2.Description of the Related Art
In a digital moving communication field such as a portable telephone, a compression coding method for speech signals transmitted at a low bit rate is required because subscribers in a digital moving communication have been increased, and research and development on the compression coding method have been carried out in various research facilities. In Japan, a coding method called a vector sum excited linear prediction (VSELP), proposed by the Motorola company, in which signals are transmitted at a bit rate of 11.2 kbits per second (kbps) is adopted as a standard coding method for a digital portable telephone. The digital portable telephone manufactured according to the VSELP coding method has been put on sale in Japan since the autumn of 1994. Also, another coding method called a pitch synchronous innovation code exited linear prediction (PSI-CELP), proposed by the NTT moving communication network Co., LTD., in which signals are transmitted at a bit rate of 5.6 kbps is adopted in Japan as a next standard coding method for a next portable telephone, and the development of the next portable telephone is going on now. These standard coding methods are obtained by improving a CELP which is disclosed by M. R. Schroeder in "High Quality Speech at Low Bit Rates" Proc. ICASSP, '85, pp.937-940. In this CELP coding method, speech information obtained from an input speech is separated into sound source information based on vibrational sounds of vocal cords and vocal tract information based on shapes of a vocal tract extending from the vocal cords to a mouth. The sound source information is coded according to a plurality of sound source samples stored in a code book while considering the vocal tract information and is compared with the input speech, and the vocal tract information is coded with a linear prediction coefficient. That is, an analysis by synthesis (A-b-S) method is adopted in the CELP coding method.
2.1.Previously Proposed Art
A fundamental algorithm of the CELP coding method is described.
FIG. 1 is a functional block diagram of a conventional speech coding apparatus according to the CELP coding method.
In FIG. 1, when a voice or speech is given to an input speech receiving unit 102 of a conventional speech coding apparatus 101 as pieces of speech data, an auto-correlation analysis and a linear prediction coefficient analysis for each of the speech data are performed in a linear prediction coefficient (LPC) analyzing unit 103 to obtain a linear prediction coefficient for each of the speech data. Thereafter, in the unit 103, each of the linear prediction coefficients is coded to obtain an LPC code, and the LPC code is decoded to obtain a reproduced linear prediction coefficient.
Thereafter, all of first sound source samples stored in an adaptive code book 104 and all of second sound source samples stored in a probabilistic code book 105 are taken out to an adding unit 106. In the adding unit 106, an optimum gain for each of the first and second sound source samples is calculated, the sound source samples are power-adjusted according to the optimum gains, and a plurality of synthesis sound sources are obtained as a result of all combinations of the power-adjusted first sound source samples and the power-adjusted second sound source samples. That is, each of the synthesis sound sources is obtained by adding one of the power-adjusted first sound source samples and one of the power-adjusted second sound source samples.
Thereafter, in an LPC synthesizing unit 107, the synthesis sound sources are filtered with the reproduced linear prediction coefficient obtained in the LPC analyzing unit 103 to obtain a plurality of synthesis speeches. Thereafter, in a comparing unit 108, a distance between each of the speech data received in the input speech receiving unit 102 and each of the synthesis speeches is calculated, a particular synthesis speech corresponding to a particular distance which is the minimum value among the distances is selected from the synthesis speeches, and a particular first sound source sample and a particular second sound source sample corresponding to the particular synthesis speech are obtained.
Thereafter, in a parameter coding unit 109, the optimum gains calculated in the adding unit 106 are coded to obtain a plurality of gain codes. The LPC code obtained in the LPC analyzing unit 103, index codes indicating the particular sound source samples obtained in the comparing unit 108 and the gain codes are transmitted to a transmission line 110 in a group. Also, a synthesis sound source is generated from a gain code corresponding to the particular first sound source sample and the particular first sound source sample in the unit 109. The synthesis sound source is stored in the adaptive code book 104 as a first sound source sample, and the particular first sound source sample is abandoned.
In addition, in the LPC synthesizing unit 107, acoustic feeling for each of the speech data is weighted with the linear prediction coefficient, a frequency emphasizing filter coefficient and a long-term prediction coefficient obtained by performing a long-term prediction analysis for each of the speech data. Also, the sound source samples are found out from sub-frames obtained by dividing each of analyzing blocks in the adaptive code book 104 and the probabilistic code book 105.
Also, the linear prediction coefficient analysis performed in the LPC analyzing unit 103 is utilized in various coding methods. A conventional linear prediction coefficient analysis is described with reference to FIG. 2.
FIG. 2 is a block diagram of a conventional linear prediction coefficient analyzing apparatus.
As shown in FIG. 2, when a speech is input to an input speech receiving unit 112 of a conventional linear prediction coefficient analyzing apparatus 111, the speech is converted into a plurality of speech signals Xi respectively having a prescribed analyzing period, and each of the speech signals Xi output time-sequentially is multiplied by a window coefficient Wi in a window putting unit 113. For example, a coefficient in a Hamming window, a Hanning window, a Blackman-Harris window or the like is used as the window coefficient Wi. A window putting processing in the unit 113 is formulated as follows. EQU Yi=Wi*Xi
Here, i denotes the numbers of the speech signals (i=1 to L), L denotes the number of speech signals, and Yi denotes a plurality of window-processed speech signals.
Thereafter, an auto-correlation analysis is performed for the window-processed speech signals Yi in an auto-correlation analyzing unit 114 as follows. ##EQU1## Here, Vj denotes a plurality of auto-correlation functions, and j denotes the numbers of the auto-correlation functions.
Thereafter, a linear prediction analysis based on an auto-correlation method is performed in a linear prediction coefficient analyzing unit 115 to obtain a linear prediction coefficient for each of the speech signals. The linear prediction analysis is disclosed in various speech information processing documents such as "The Autocorrelation Method" in a literature written by L. R. Labiner and R. W. Schafer "Digital Processing of Speech Signals" pp.401-403.
Also, because the speech information obtained from the input speech is coded according to one of improved CELP coding methods, a plurality of speech signals indicating the speech information can be transmitted at a very low bit rate. However, because the speech information is compressed according to a speech vocalizing model, sound information including the speech information cannot be appropriately processed according to any of the improved CELP coding methods. That is, in cases where a background noise or a set noise exists with the speech signals, there is a drawback that the speech signals cannot be efficiently coded and allophone occurs in a reproduced speech. To solve this drawback, a method for reducing a noise existing with the input speech signals is proposed. For example, a noise existing with the speech signals is reduced by a noise canceler in the standardized PSI-CELP coding method before the speech signals are coded. The noise canceler is composed of a Kalman filter. That is, the existence of a speech is detected and the speech is adaptively controlled by the Kalman filter to reduce a noise existing with the speech. Therefore, the background noise can be reduced in some degree by the noise canceler. However, a noise having a high level or a noise included in a speech cannot be effectively reduced or subtracted.
As a more effective noise reduction method, a spectrum subtraction method is disclosed in a literature written by S. F. Boll "Suppression of Acoustic Noise in Speech Using Spectral Subtraction" IEEE, Trans. ASSP. Vol.27, No.2, pp.113-120, 1979. In the spectrum subtraction method, a discrete Fourier transformation is performed to convert a plurality of input speech signals into a plurality of spectra, and one or more noises are subtracted from the spectra. This method is mainly applied for a speech input unit of a speech recognition apparatus. A conventional noise subtraction apparatus in which the spectrum subtraction method is applied to subtract a noise included in a speech signal from the speech signal is described with reference to FIG. 4.
As shown in FIG. 4, a noise spectrum is assumed in a first procedure, and a noise of which the spectrum is assumed is subtracted from a speech signal in a second procedure. In the first procedure, a plurality of noise signals Sn indicating a noise is input in series to an analog-digital (A/D) converter 122 of a conventional noise subtraction apparatus 121, and the noise signals Sn are converted into a plurality of digital noise signals. In this case, any speech signal is not included in the noise signals Sn. Thereafter, a discrete Fourier transformation is performed in a Fourier transforming unit 123 for each frame of digital noise signals, and a noise spectrum is obtained for each frame. Each frame is composed of a series of digital noise signals having a constant time length. Thereafter, an average noise spectrum is obtained in a noise analyzing unit 124 by averaging a plurality of noise spectra, and the average noise spectrum is stored in a noise spectrum storing unit 125 as a representative noise spectrum of the noise. The first procedure is performed for various noise signals indicating various types of noises, and a plurality of representative noise spectra indicating the various types of noises are stored in the storing unit 125. In the second procedure, a plurality of speech signals Ss which indicate a speech including a noise are input in series to an A/D converter 126, and a plurality of digital speech signals are obtained. Thereafter, a discrete Fourier transformation is performed in a Fourier transforming unit 127, and a speech spectrum including an actual noise spectrum is obtained. Thereafter, one representative noise spectrum matching with the actual noise spectrum is read out from the storing unit 125, and the representative noise spectrum read out is subtracted from the speech spectrum in a noise subtracting unit 128 to cancel the actual noise spectrum. Thereafter, an inverse Fourier transformation is performed for the speech spectrum in an inverse Fourier transforming unit 129, and a speech output signal So is obtained.
To obtain each of the noise and speech spectra, an amplitude spectrum for each of noises and speech is calculated. That is, a real component of a norm defined in a complex plane for the amplitude of a noise or speech and an imaginary component of the norm are respectively squared, the real component squared and the imaginary component squared are added each other to obtain a squared absolute value, and a square root of the squared absolute value is calculated as the amplitude spectrum. Also, in cases where the inverse Fourier transformation is performed for the amplitude spectrum from which a noise spectrum is subtracted, a phase component of each speech signal Ss is used as a phase component of the amplitude spectrum.
2.2. Problems to be Solved by the Invention
(1) To set a speech coding apparatus and a decoding apparatus in a small-sized apparatus such as a portable telephone, it is required to reduce a memory capacity of a read only memory (ROM) in which a plurality of first sound source samples of an adaptive code book and a plurality of second sound source samples of a probabilistic code book are stored. However, because a large number of code vectors are required to store a plurality of fixed sound sources representing the second sound source samples in the conventional speech coding apparatus 101, it is difficult to set the apparatus 101 in the small-sized apparatus. To reduce the number of code vectors stored in the ROM, for example, a long vector is shifted to be used as a plurality of code vectors. However, similar code vectors are obtained by shifting the long vector, and there is a drawback that a quality of reproduced speech deteriorates as compared with that reproduced by using a large number of code vectors different from each other. Also, because it is required to calculate a code vector each time the code vector is generated, there is another drawback that a large volume of calculation is required. PA0 (2) Also, because the VSELP coding method and the PSI-CELP coding method are obtained by improving the CELP coding method, the same processing is performed for any input voice or speech in the VSELP coding method and the PSI-CELP coding method. Therefore, the input voice or speech cannot be efficiently coded. PA0 (3) Also, in cases where the window coefficients Wi are utilized in the conventional linear prediction coefficient analyzing apparatus 111, because a value of each window coefficient Wi at a central portion of an analyzing period is high and values of each window coefficient Wi at both end portions of the analyzing period is very low, there is a drawback that a piece of information for each window-processed speech signal Yi represents a piece of information for each speech signal Xi at the central portion of the analyzing period. To prevent this drawback, as shown in FIG.3, a rear part of preceding speech signal Xi-1 at a rear portion of a preceding analyzing period, a current speech signal Xi at a current analyzing period and a front part of succeeding speech signal Xi+1 at a front portion of a succeeding analyzing period output from the input speech receiving unit 112 in that order are multiplied by a window coefficient Wi for the current speech signal Xi in a normal CELP coding method. In this case, a piece of information about the entire current speech signal Xi can be reflected on a piece of information for a current window-processed speech signal Yi. PA0 (4) Also, though the spectrum subtraction method performed in the conventional noise subtraction apparatus 121 is more effective to subtract a noise from a speech, in cases where the method is applied for a real-time speech processing apparatus, there are many drawbacks in a noise assuming method or a manufacturing cost of the apparatus. A first drawback is that the assumption of a noise spectrum is difficult because a position of a speech signal existing in pieces of data cannot be specified. A second drawback is that a calculation volume in the apparatus is large. A third drawback is that a memory capacity required to store the noise spectra in a random access memory is large. A fourth drawback is that a speech spectrum from which a noise spectrum having a high intensity is subtracted is largely distorted and a quality of a reproduced speech deteriorates.
Pieces of speech information recorded in a real circumstance greatly differ from each other in a viewpoint of local characteristics. Each of the speech information is composed of one or more voice portions and one or more silent portions. Voice of the voice portion is composed of one or more consonants and one or more vowels. Each consonant is classified into a voiceless consonant and a voiced consonant. Each vowel is divided into a vowel stationary portion and a vowel transitional portion. In the vowel stationary portion, a voice pitch and a movement of a mouth are stable. In the vowel transitional portion, the voice pitch and the mouth movement always change. Therefore, because the silent portion, the voiceless consonant, the voiced consonant, the vowel stationary portion and the vowel transitional portion have different characteristics, an optimum coding method exists in each of them.
In cases where the CELP coding method is adopted, how the voice information is coded while considering the local characteristics is described. Because there is no voice in a period of the silent portion, only a noise in the real circumstance exists in the silent portion, and a time length of the silent portion is required to be informed. Therefore, time information of the silent portion can be coded at a very low bit rate by omitting the sound source samples. The voiceless consonant is classified into an affricate such as a phoneme /p/, /t/ or the like and a fricative such as a phoneme /s/, /h/ or the like. Because a voice power of the affricate minutely changes and it is important to recognize the minute change, it is preferable that the affricate be coded in a unit of a short frame length. Therefore, the first sound source samples stored in the adaptive code book 104 are not required to code the affricate. Also, in case of the fricative, a radiance characteristic and a time length are important. Therefore, the first sound source samples stored in the adaptive code book 104 are not required to code the fricative. In case of the voiced consonant, a minute voice power change, vocal tract information and sound source information are important. Therefore, the most large volume of information is required to code the voiced consonant. In the vowel stationary portion, a plurality of waves having similar shaped waveforms are formed in series. Therefore, the vowel stationary portion can be coded by using a small volume of information in cases where the first sound source samples stored in the adaptive code book 104 are used. In the vowel transitional portion, the change of the vocal tract information and the sound source information is larger than that in the vowel stationary portion, and a voice power in the vowel transitional portion is large. Therefore, the degradation of a tone quality can be easily noticed. Accordingly, a large volume of information is required in the same manner as in the voiced consonant to code the vowel transitional portion.
Therefore, in cases where a coding method is locally changed for each of the silent portion, the affricate, the fricative, the voiced consonant, the vowel stationary portion and the vowel transitional portion to adaptively distribute pieces of information, the input speech can be efficiently coded. That is, because local characteristics of the speech information recorded in the real circumstance greatly differ from each other, in cases where the speech information are adaptively coded while positively using the local characteristics, a coding efficiency can be improved, and a plurality of synthesis speeches can be preferably obtained at a lower average bit rate. Based on this idea, a coding method in which a plurality of coding modules are used is proposed. For example, a variable bit-rate speech coding method is disclosed in the paper 2-Q-23 read in the spring research convention of Japanese Acoustic Society, and a QCELP method is proposed by the Qaucom company. The QCELP method is adopted as a standard coding method (TIA-IS96) for a digital cellular phone in the North America.
However, one of a plurality of coding modules is selected according to a simple rule in the variable bit-rate speech coding method and the QCELP method. Therefore, there is a probability that a coding module not adapted for a piece of speech information is selected by mistake, and there is a drawback that a rasping allophone occurs. To solve this drawback in a speech coding apparatus operated according to the analysis by synthesis method, the speech information is coded by using each of all coding modules, a plurality of coding distortions corresponding to the coding modules are compared with each other, and a coding module corresponding to a coding distortion which is the smallest among the coding distortions is adopted as the most adaptive coding module. However, in this case, a volume of calculation required to determine the adaptive coding module extremely becomes large, and it is difficult to arrange the speech coding apparatus operated according to the above selection method in a small sized communication apparatus such as a portable telephone. Also, it is difficult to make a complicated rule for a correct selection of the adapted coding module for the purpose of avoiding the occurrence of the allophone.
However, because the multiplication is waited until the front portion of the succeeding analyzing period passes, a coding process performed in a codec is delayed by a period equivalent to the front portion of the succeeding analyzing period. To reduce this coding process delay, the front portion of the succeeding analyzing period is shortened to several mil seconds in a codec used for a digital moving communication such as a portable telephone. In this case, it is difficult that the information about the entire current speech signal Xi is reflected on the information for the current window-process ed speech signal Yi. Therefore, when a piece of speech such as a voiced consonant in which a speech spectrum largely changes is input to the input speech receiving unit 112, there is a drawback that a quality of reproduced speech locally deteriorates.