1. Field of the Invention
The present invention relates to a speech reproducing system configured to decode a speech coded information which is outputted from a speech coder by coding an input speech signal and which includes a pitch information and a mode information which is a short-time characteristics of the speech, obtained by analyzing the input speech signal, and furthermore to convert a speech-rate of a decoded speech signal, so as to generate an output speech signal. More specifically, the present invention relates to a speech reproducing system capable of reducing the amount of computation and of minimizing deterioration of the speech quality in reproducing a speech signal outputted after coding and decoding, as in an automatic answering telephone set having a solid state recording-reproducing device, by modifying only the speech-rate without changing the pitch (or frequency) of the speech or the timbre of the speech
2. Description of Related Art
In the prior art, a technology of coding a speech signal to compress the amount of data is widely utilized in order to realize an efficient transmission and an efficient storage.
For example, as the speech coding system capable of obtaining a high compression ratio, a CELP (Code Excited Linear Prediction) system can be exemplified, which is disclosed in detail by, for example, Ozawa, "Speech Coding Technology" included in the Japanese language book "Mobile Communication Digitizing Technology", which is called a "Reference 1" in this specification and the content of which is incorporated by reference in its entirety into this application.
In brief, in this CELP scheme, an input speech signal is coded by obtaining information of a spectrum component of the input speech signal in accordance with a linear predictive analysis, and by vector-quantizing information of a sound source signal by use of an adaptive codebook and a source source codebook. In a decoding, a LPC (Linear Predictive Coding) filter obtained by the linear predictive analysis, is excited in accordance with a quantized vector obtained from an adaptive codebook and a source codebook, so that a speech signal is obtained. In the vector-quantization based on the adaptive codebook, there is obtained a delay information which is a period of a repetitive component in the speech, and the quantized vector is described using the adaptive code vector which is the repetitive component having the period of the delayed information. Thus, a quantizing efficiency is elevated.
In addition, an M-LCELP (Multirnode-Learned CELP) system is disclosed by Ozawa et al, "4 kbps high quality M-LCELP speech coding", NEC Technical Disclosure Bulletin, Vol. 48, No. 6, which is called a "Reference 2" in this specification and the content of which is incorporated by reference in its entirety into this application. In this system, mode information expressed by no sound or a no-sound portion, a transient portion, a weak steady portion of a voiced sound, or a steady portion of the voiced sound, is determined by using a basic period of the speed or the like, and the adaptive codebook or the sound source codebook is switched over for each one of the modes.
Now, an example of the speech coder of the M-LCELP scheme will be described with reference to FIG. 1, which is a block diagram illustrating a fundamental principle of the speech coder of the M-LCELP scheme.
The speech coder generally designated with Reference Numeral 10, includes a linear predictive analyzer 11 receiving an input speech signal Vin to conduct a linear predictive analysis for the input speech signal Vin for each frame having a constant time length, so that a linear predictive coding LPC is obtained. The speech coder 10 also includes a mode discriminator 12 receiving the input speech signal Vin to determine, on the basis of the strength of a basic period of the speech in the frame, a speech mode information M indicative of no sound or a no-sound portion, a transient portion, a weak steady portion of a voiced sound or a steady portion of the voiced sound.
An adaptive codebook retrieval unit 13 receives the input speech signal Vin, the linear predictive coding LPC and the mode information M, and generates a delay information AC indicative of a repetitive component of the speech. A sound codebook retrieval unit 14 receives the input speech signal Vin, the linear predictive coding LPC, the mode information M and the delay information AC, and refers to a sound source codebook 41, to output a sound source code EC which is a sound source information.
A signal output unit 15 receives the linear predictive coding LPC, the mode information M, the delay information AC, and the sound source code EC, and outputs a speech coded information IDX having a predetermined format including the linear predictive coding LPC, the mode information M, the delay information AC, and the sound source code EC.
Now, an example of the speech decoder of the M-LCELP scheme will be described with reference to FIG. 2, which is a block diagram illustrating a fundamental principle of the speech decoder of the M-LCELP scheme.
In the speech decoder generally designated with Reference Numeral 20, a signal input unit 21 receives the speech coded information IDX and outputs the linear predictive coding LPC, the mode information M, the delay information AC, and the sound source code EC.
An adaptive codebook decoder 22 receives the mode information M and the delay information AC, to decode and reproduce an adaptive code vector. A sound source codebook decoder 23 receives the mode information M and the sound source code EC to decode and reproduce the sound source information with reference to a sound source codebook 42.
An adder 24 receives the adaptive code vector decoded by the adaptive codebook decoder 22 and the sound source information decoded by the sound source codebook decoder 23, and generates an added signal S, which is supplied to a synthesizing filter 25 which also receives the linear predictive coding LPC from the signal input unit 21. The synthesizing filter 25 generates a decoded speech signal VDEC.
On the other hand, a speech-rate converting technology for reproducing a speech when the same speaker spoke quickly or slowly, without changing the pitch (or frequency) of the speech or the timbre of the speech, is used in a video tape recorder, a hearing aid, or an automatic answering telephone set.
As regards this speech-rate converting technology, various applications were proposed by Kato, "Speech-rate Converting Technology entered into Actual Use Stage, to Fundamental Function of Speech Output Instruments", Nikkei Electronics, No. 622, November 1994 (which is called a "Reference 3" in this specification and the content of which is incorporated by reference in its entirety into this application).
Many speech-rate converting systems used in these applications are based on a TDHS (Time Domain Harmonic Scaling) scheme. This TDHS scheme is configured to slice the speech signal for each pitch and to make a window processing, and then to superpose the sliced signals, as shown by, for example, Furui, "Digital Speech Processing" published from Tokai University Publishing Company in 1985 (which is called a "Reference 4" in this specification and the content of which is incorporated by reference in its entirety into this application).
Now, the TDHS scheme will be described with reference to FIGS. 3A and 3B.
FIG. 3A illustrates the TDHS processing for multiplying the input speech signal by 1/2. As shown in FIG. 3A, the input speech signal is sliced out in units of two pitches, and a window function processing is conducted, and thereafter, the sliced two pitches of speech signal thus processed are superposed to generate an output speech signal. After this series of processings are completed, next two pitches of speech signal are supplied, and the above mentioned TDHS processing is conducted again.
Thus, since each two pitches of the speech signal is outputted as one pitch of speech signal, the length of the signal is shortened to one half.
FIG. 3B illustrates the TDHS processing for multiplying the input speech signal by 2. As shown in FIG. 3B, the input speech signal is sliced out in units of two pitches, and one pitch of two pitches of speech signal thus obtained is outputted as it is. On the other hand, a window function processing is conducted for the sliced two pitches of speech signal, and thereafter, the sliced two pitches of speech signal thus processed are superposed to generate an output speech signal, which is coupled to the first one pitch of speech signal. After this series of processings are completed, a next one pitch of speech signal is supplied, and the above mentioned TDHS processing is conducted again.
Thus, since each two pitches of the speech signal is outputted as four pitches of speech signal, the length of the signal is elongated to two times.
Next, a prior art speech-rate converter will be described with reference to FIG. 4, which is a block diagram of the speech-rate converter disclosed by Japanese Patent Application Pre-examination Publication No. JP-A-1-093795, (which is called a "Reference 5" in this specification and the content of which is incorporated by reference in its entirety into this application, and an English abstract of JP-A-1-093795 is available from the Japanese Patent Office, and the content of the English abstract of JP-A-1-093795 is also incorporated by reference in its entirety into this application).
The speech-rate converter shown is generally designated by Reference Numeral 300, and includes a waveform editor 32, a pitch extractor 33 and a speech short-tine characteristics discriminator 34.
The pitch extractor 33 receives an input speech signal VDEC and obtains a pitch information T by use of an autocorrelation method. The speech short-time characteristics discriminator 34 receives the input speech signal VDEC, and executes at least one of a discrimination as to whether or not a speech power exists, a PARCOR (Partial Autocorreltion) analysis, and a zero-crossing analysis, and discriminates in which of a vowel period, a voiced consonant period, a voiceless consonant period, a no-sound period the input speech signal VDEC is, so that the speech short-time characteristics information SP is outputted.
The waveform editor 32 receives the input speech signal VDEC, the pitch information T and the speech short-time characteristics information SP, and conducts the speech-rate converting processing as disclosed in "Reference 5" for the input speech signal VDEC, on the basis of the pitch information T and the speech short-time characteristics information SP. Namely, a thinning-out processing and a repeating processing of the waveform is conducted. Thus, an output speech signal VOUT is generated.
The prior art speech reproducing system is constructed to code the speech, to store the coded speech, to decode the stored coded speech, and thereafter to conduct the speech-rate conversion, for the purpose of reproducing the speech, as in the automatic answering telephone set having a solid state recording-reproducing device.
Now, the prior art speech reproducing system will be described with reference to FIGS. 1, 2 and 4 and also with reference to FIG. 5, which is a block diagram illustrating the speech reproducing system obtained by combining the speech coder 10, the speech decoder 20 and the speech-rate converter 300.
As described with reference to FIG. 1, the speech coder 10 codes and compresses the input speech signal Vin by use of the M-LCELP scheme, to output the speech coded information IDX, which can be stored in a memory (not shown) or the like. As described with reference to FIG. 2, the speech decoder 20 decodes the speech coded information IDX (which can be read out from the memory (not shown)) by use of the M-LCELP scheme, to output the decoded speech signal VDEC. As described with reference to FIG. 4, the speech-rate converter 300 conducts the speech-rate converting processing to the decoded speech signal VDEC, to generate the output speech signal VOUT.
The above mentioned prior art speech reproducing system includes the speech-rate converter which receives the decoded speech signal obtained by decoding the coded signal which is obtained by coding the speech signal by use of the M-LCELP scheme, and which executes the speech-rate converting processing to the received decoded speech signal in accordance with the TDHS scheme. In this speech-rate converter, as mentioned above, the pitch extractor 33 obtains the pitch information T by use of the autocorrelation method or another. The speech short-time characteristics discriminator executes the discrimination as to whether or not a speech power exists, the PARCOR analysis, and the zero-crossing analysis, to generate the speech short-time characteristics information.
In this arrangement, however, the amount of computation conducted in the pitch extractor for obtaining the pitch information and the amount of computation conducted in the speech short-time characteristics discriminator for obtaining the speech short-time characteristics information, are generally large, and therefore, a large amount of program and a large amount of processing time are required. This is disadvantageous.
In addition, there is possibility that the speech based on the decoded speech signal processed by the M-LCELP scheme is deteriorated in comparison with an original speech. If it is deteriorated, an effective pitch information and an effective speech short-time characteristics information required for the speech-rate converting processing, may not be obtained, resulting in high possibility that the output speech signal has a sound quality deteriorated in comparison with an original speech.