1. Field of the Invention
The present invention relates to a method and system for encoding and decoding speech signals at a low-bit rate with a high efficiency, and, more particularly, to a speech recognition-synthesis based encoding method of encoding speech signals at a very low-bit rate of 1 kbps or lower, and a speech encoding/decoding method and system which use the speech recognition-synthesis based encoding method.
2. Discussion of the Background
Techniques of encoding speech signals with a high efficiency are now essential in mobile communication which has a limited available radio wave band and storage media like a voice mail which demands efficient memory usage, and are being improved to seek lower bit rates. CELP (Code Excited Linear Prediction) is one of effective schemes of encoding speech of a telephone band at a transfer rate of about 4 kbps to 8 kbps.
This CELP system is specifically discussed in "Code Excited Linear Prediction (CELP): High Quality Speech at Very Low Bit Rates" by M. R. Schroeder and B. S. Atal, Proc. ICASSP, pp. 937-940, 1985, and "Improved Speech Quality and Efficient Vector Quantization in SELP" by W. S. Kleijin, D. J. Krasinski et al., Proc. ICASSP, pp. 155-158, 1998 (Document 1).
This document 1 shows that this system is separated to a process of acquiring a speech synthesis filter which is a model of a vocal tract from an input speech divided frame by frame, and a process of obtaining excitation vectors which are input signals to this filter. The second process passes a plurality of excitation vectors, stored in a codebook, through the speech synthesis filter one by one, computes distortion between the synthesized speech and the input speech, and finds the excitation vector which minimizes this distortion. This process is called closed loop search, which is very effective in reproducing a good speech quality at a bit rate of as low as 4 kbps to 8 kbps.
An LPC vocoder is known as a scheme of encoding speech signals at a lower bit rate. The LPC vocoder provides a model of a vocal signal with a pulse train and a white noise sequence and a model of a vocal characteristic by an LPC synthesis filter, and encodes those parameters. This scheme can encode speech signals at a rate of approximately 2.4 kbps at the price of a lower speech quality. Those encoding systems are designed to transfer linguistic information about what a speaker is saying as well as information the original speech waveform has, such as personality, vocal property and feeling, with as high a fidelity as possible perceptually, and are used mainly in telephone-based communications.
Due to the recent popularity of Internet, the number of subscribers who use a service called net chatting is increasing. This service provides real-time chatting of one-to-one, one-to-multiple and multiple-to-multiple on a network, and employs a system which is based on the aforementioned CELP system to transfer speech signals. The CELP system, which has a bit rate lower by 1/8 to 1/16 than that of the PCM system, can ensure efficient transfer of speech signals. But, the number of users who use Internet is rapidly increasing, which often heavily loads a network. This delays the transfer of speech information, and thus interferes with smooth chatting.
A solution to such a situation requires a technique of encoding speech signals at a lower bit rate than that of the CELP system. As an extreme way of encoding at a low bit rate is known recognition-synthesis based encoding which recognizes linguistic information of a speech, transfers a string of characters which represents the linguistic information, and executes rule-based synthesis on the character string on the receiver side. This recognition-synthesis based encoding, which is briefly introduced in "Highly Efficient Speech Encoding" by Kazuo Nakada, Morikita Press (Document 2), is said to be able to transfer speech signals at a very low rate of about several dozens to 100 bps.
The recognition-synthesis based encoding however requires that a speech should be acquired by performing a rule-based synthesis on a character string obtained by the use of a speech recognition scheme. If speech recognition is incomplete, therefore, intonation may become significantly unnatural, or the contents of conversation may be in error. In this respect, the recognition-synthesis based encoding is premised on the complete speech recognition technique, due to which there is no practical recognition-synthesis based encoding implemented yet, and which it seems makes it difficult to realize the encoding system in future too.
Because such a method of carrying out communication after converting speech signals or physical information into linguistic information which is advanced abstract information is difficult to realize, an encoding scheme has been proposed which recognizes speech signals as more physical information and converts the former to the latter. One known example of this scheme is "Vocoder Method And Apparatus" described in Jpn. Pat. Appln. KOKOKU Publication No. Hei 5-76040 (Document 3).
The document 3 describes an analog speech input sent to a speech recognition apparatus and then converted to a phonetic segment stream there. The phonetic segment stream is converted by a phonetic segment/allophone synthesizer to its approximated allophone stream by which a speech is reproduced. In the speech recognition apparatus, an analog speech input is sent to a formant trucker, while its signal gain is kept at a given value by an AGC (Automatic Gain Controller), and a formant in the input signal is detected and stored in a RAM. The stored formant is sent to a phonetic segment boundary detector to be segmented to phonetic components. The phonetic segments is checked against a phonetic segment template for a match by a recognition algorithm, and the recognized phonetic segment is acquired.
In the phonetic segment/allophone synthesizer, an allophone stream corresponding to the input phonetic code is read from a ROM and then sent to a speech synthesizer. The speech synthesizer acquires parameters necessary for speech synthesis, such as the parameter of a linear prediction filter, from the received allophone stream, and acquires a speech through synthesis using those parameters. What is called "allophone" is a speech which is a phonetic segment affixed with an attribute determined in accordance with predetermined rules using phonetic segments around the former one. (The attribute indicates if the phonetic segment is an initial speech, an intermediate speech or an ending speech, or if it is a nasal-voiced or unvoiced.)
The key point of the scheme described in the document 3 is that a speech signal is simply converted to a phonetic symbol string, not to a character string as linguistic information, and the symbol string is associated with physical parameters for speech synthesis. This design brings about such an advantage that even if a phonetic segment is erroneously recognized, a sentence as a whole does not change much though the erroneous phonetic segment is changed to another phonetic segment.
The document 3 describes that because of the natural filtering by human ears and error correction by a listener in the though process, errors which are produced by the recognition algorithm is minimized by acquiring the best matching, if not complete recognition.
Since the encoding method disclosed in the document 3 simply transfers a symbol string representing phonetic segments from the encoding side, a synthesized speech reproduced on the encoding side becomes unnatural without intonation or rhythm, so that the contents of the conversation are merely transmitted but information on the speaker or information on the speaker's feeling will not be transmitted.
In short, those prior arts have the following shortcomings. Because the conventional recognition-synthesis system which recognizes linguistic information of a speech, transfers a character string expressing that information and performs rule-based synthesis on the decoding side is premised on the complete speech recognition technique, it is practically difficult to realize.
Further, the known encoding system, which can employ even an incomplete speech recognition scheme, simply transfers a symbol string representing phonetic segments from the encoding side, a synthesized speech reproduced on the encoding side becomes unnatural without intonation or rhythm, so that the contents of the conversation are merely transmitted but information on the speaker or information on the speaker's feeling will not be transmitted.