The present invention relates to a method for encoding speech at a low bit rate and, more particularly, to a method for encoding speech and a method for decoding speech wherein a speech signal including a background noise is encoded by compressing it efficiently in a state which is as close to the original speech as possible.
Further, the present invention relates to a method for encoding speech wherein a speech signal is compressed and encoded, and, more particularly, to speech encoding used for digital telephones and the like and a method for encoding speech for speech synthesis used for text read-out software and the like.
Conventional low-bit-rate speech coding is directed to efficient coding of a speech signal and is carried out according to speech coding methods which employ a model of a speech production process. Among such methods for speech coding, methods based on a CELP system have recently been spreading remarkably. When such a method for encoding speech on a CELP basis is used, a speech signal input in an environment having little background noise can be encoded efficiently because the signal matches the model for encoding, and this allows encoding with deterioration of speech quality at a relatively low level.
However, it is known that when a method for encoding speech on a CELP basis is used for a speech signal input under a condition where a background noise is at a high level, the background noise included in a reproduced output signal comes out very differently to produce speech which is very unstable and uncomfortable. Such a tendency is significant especially at an encoding bit rate of 8 kbps or less.
In order to mitigate this problem, a method has been proposed wherein the CELP encoding is performed using a more noisy excitation signal for a time window which has been determined to be a background noise to mitigate deterioration of speech quality in such a window of a background noise. Although such a method provides some improvement of speech quality in the window for a background noise, the improvement is problematically insufficient in that the tendency of producing a noise that sounds differently from the background noise in the original speech still remains because a model of a speech production process is used in which speech is synthesized by having the excitation signal passed through a synthesis filter.
As described above, the conventional method for encoding speech has a problem in that when a speech signal input under a condition where a background noise is at a high level is encoded, the background noise included in a reproduced output signal comes out very differently to produce speech which is very unstable and uncomfortable.
It is an object of the present invention to provide a method for low-rate speech coding and decoding wherein speech including a background noise can be reproduced in a state as close to the original speech as possible.
It is another object of the invention to provide a method for a low-rate speech coding and decoding wherein a background noise can be encoded with a number of bits as small as possible to reproduce speech including a background noise in a state as close to the original speech as possible.
It is still another object of the invention to provide a method for encoding speech wherein encoding can be performed such that abrupt changes and fluctuations of pitch periods are reflected to obtain high quality decoded speech.
According to the present invention, there is provided a method for encoding speech comprising separating an input speech signal into a first component mainly constituted by speech and a second component mainly constituted by a background noise at each predetermined unit of time, selecting bit allocation for each of the first and second components from among a plurality of candidates for bit allocation based on the first and second components, encoding the first and second components under such bit allocation using predetermined different methods for encoding, and outputting data on the encoding of the first and second components and information on the bit allocation as encoded data to be transmitted.
According to the CELP encoding, as described above, when a speech signal input under a condition wherein a background noise is at a high level, the background noise included in a reproduced speech signal comes out very differently to produce speech which is very unstable and uncomfortable. This phenomenon is attributable to the fact that the background noise has a model which is completely different from that for speech signals to which CELP works well, and it is desirable to perform a background noise using a method appropriate for it.
According to the present invention, an input speech signal is separated into a first component mainly constituted by speech and a second component mainly constituted by a background noise at each predetermined unit of time, and encoding is performed using methods for encoding based on different models which are respectively adapted to the characteristics of the speech and background noise to improve the efficiency of the encoding as a whole.
The first and second components are encoded using bit allocation selected from among a plurality of candidates for bit allocation based on the first and second components such that each component can be more efficiently encoded. This makes it possible to encode the input speech signal efficiently with the overall bit rate kept low.
In the method for encoding according to the invention, the first component is preferably encoded in the time domain and the second component is preferably encoded in the frequency domain or transform domain. Specifically, since speech is information which quickly changes at relatively short intervals on the order of 10 to 15 ms, the first component mainly constituted by speech can be encoded with high quality by using a method such as the CELP type encoding which suppresses distortion of a waveform in the time domain. On the other hand, since a background noise slowly changes at relatively long intervals in the range from several tens ms to several hundred ms, the information of the second component mainly constituted by a background noise can be more easily extracted with less bits by encoding the components after converting them into parameters in the frequency domain or transform domain.
In the method for encoding speech according to the invention, the total number of bits for encoding that are allocated for the predetermined units of time is preferably fixed. Since this makes it possible to encode an input speech signal at a fixed bit rate, encoded data can be more easily processed.
Further, in the method for encoding speech according to the invention, it is preferable that a plurality of methods for encoding are provided for encoding the second component and that at least one of those method encodes the spectral shape of the current background noise utilizing the spectral shape of a previous background noise which has already been encoded. Since this method for encoding allows the second component to be encoded with a very small number of bits, resultant spare encoding bits can be allocated for the encoding of the first component to prevent deterioration of the quality of decoded speech.
When an input speech signal is encoded using the method for encoding based on models adapted respectively to the first component mainly constituted by speech and the second component mainly constituted by a background noise, although the production of an uncomfortable sound can be avoided. However, if the background noise is superimposed on the speech signal, i.e., if both of the first and second components separated from the input speech signal have power which can not be ignored, the absolute number of the bits for encoding the first component runs short and, as a result, the quality of the decoded speech is significantly reduced.
In such a case, with the above-described method for encoding the spectral shape of the current back ground noise utilizing the spectral shape of a previous background noise which has already been encoded, the second component mainly constituted by a background noise can be encoded with a very small number of bits, and the resultant spare encoding bits cam be allocated for the encoding of the first speech mainly constituted by speech to maintain the decoded speech at a high quality level.
According to the method for encoding the spectral shape of the current background noise utilizing the spectral shape of a previous background noise, for example, a power correction coefficient is calculated from the spectral shape of the previous background noise and the spectral shape of the current background noise, the power correction coefficient is quantized thereafter, the spectral shape of the previous background noise is multiplied by the quantized power correction coefficient to obtain the spectral shape of the current background noise, and an index obtained during the quantization of the power correction coefficient is used as encoded data.
The spectral shape of a background noise is constant for a relatively long period as one can easily assume from, for example, a noise in a traveling automobile or a noise from a machine in an office. One can consider that such a background noise is subjected to substantially no change in the spectral shape thereof but a change of the power thereof. Therefore, once the spectral shape of a background noise is encoded, the spectral shape of the background noise may be regarded fixed thereafter and encoding is required only for the amount of change in power. This makes it possible to represent the spectral shape of a background noise using a very small number of bits.
Further, according to the method for encoding the spectral shape of the current background noise utilizing the spectral shape a previous background noise, the spectral shape of the current background noise may be predicted by multiplying the spectral shape of the previous background noise by the above-described quantized power correction coefficient, the spectrum of the background noise in a frequency band determined according to predefined rules may be encoded using the predicted spectral shape, and the index obtained during the quantization of the power correction coefficient and an index obtained during the encoding of the spectrum of the background noise in the frequency band determined by predefined rules may be used as encoded data.
While the spectral shape of a background noise can be regarded substantially constant for a relatively long period as described above, it is not likely that the same shape remains unchanged for several tens seconds, and it is natural to assume that the spectral shape of the background noise gradually changes in such a long period. Thus, a frequency band is determined according to predefined rules, a signal representing an error between the spectral shape of the current background noise and a predicted spectral shape of the current background noise obtained by multiplying the spectral shape of a previous background noise by a coefficient, and the error signal is encoded. As a result, the above-described rules for determining the frequency band can be defined such that they are circulated throughout the entire frequency band of a background noise during a certain period of time. Thus, the shape of a background noise that gradually changes can be efficiently encoded.
According to method for decoding speech of the present invention, in order to decode transmitted encoded data obtained by encoding as described above to reproduce the speech signal, the input transmitted encoded data is separated into encoded data of the first component mainly constituted by speech, encoded data of the second component mainly constituted by a background noise, and information on bit allocation for each of the encoded data for the first and second components, the information on bit allocation is decoded to obtain bit allocation for the encoded data for the first and second components, the encoded data for the first and second component is decoded according to the bit allocation to reproduce the first and second components, and the reproduced first and second components are combined to produce a final output speech signal.
Additional objects and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims.