FIG. 1 is a block diagram of a speech coding/decoding device of the prior art which encodes and decodes speech information by using a codebook.
In a speech coding device, speech input via a speech inputting unit 1 is analyzed by a speech analysis unit 2. Results of the analysis by the speech analysis unit 2 are then quantized based on a codebook 5 by a quantizing unit 3 to generate quantization parameters The quantizing unit 3 further generates more than one type of quantization codes (e.g., indexes of the codebook) which indicate the quantized values (quantized parameters), and supplies these quantization codes of more than one type to a coding unit 4. The coding unit 4 multiplexes the quantization codes to generate encoded codes. Here, the codebook 5 is stored in a ROM.
In a speech decoding device, received encoded codes are separated by a decoding unit 6 into more than one type of quantization codes The separated quantization codes are then subjected to an inverse quantizing process based on the codebook 5 by an inverse-quantizing unit 7 to generate quantization parameters. A speech synthesizing unit 8 synthesizes the speech by using the quantization parameters, so that a speech outputting unit 9 outputs the speech.
Parameters in the codebook 5 used for the quantization process may vary in types, and different processes may be carried out, depending on the types of the parameters, by the speech analysis unit 2, the quantizing unit 3, the inverse-quantizing unit 7, and the speech synthesizing unit 8. For different types of parameters, the speech coding/decoding device may have different configurations as shown in FIG. 2.
FIG. 2 is a table chart showing two different configurations of the vice coding/decoding device of the prior art.
In FIG. 2, the type-1 speech coding/decoding device uses speech waveforms as parameters of the codebook 5. In the speech coding unit, an input speech signal is divided (windowed) into speech signals of a predetermined time length. The quantizing unit then searches in the codebook for speech waveforms closest to the windowed speech signals, and obtains quantization codes of these speech waveforms. In the speech decoding unit, speech waveforms are successively extracted from the codebook by using received quantization codes. The speech waveforms are then interpolated and connected by the speech synthesizing unit to output a speech signal.
The type-2 speech coding/decoding device of type 2 is a device based on a CELP (code excited linear prediction) method, for example, and uses speech-source signals and LPC coefficients as parameters of the codebook. In the type-2 speech coding device, a speech signal is divided (windowed) into speech signals of a predetermined time length, and an LPC analysis is applied. The quantizing unit searches in the codebook for quantized LPC coefficients (quantization parameters) closest to the results of the analysis and for quantization codes indicating the quantization parameters, and, also, searches in the codebook for the most appropriate speech source. In the speech decoding unit, LPC coefficients and speech-source signals are extracted from the codebook by using received quantization codes. The synthesizing unit then synthesizes speech by using the LPC coefficients and the speech-source signals.
In the following, an example of a configuration when the CELP method is used in a speech coding/decoding device will be described.
FIG. 3 is a block diagram of an example of a speech coding device which employs the CELP method. In FIG. 3, the same reference numerals as those of FIG. 1 represent corresponding circuit blocks of FIG. 1.
The speech coding device of the CELP method emulates vocal-cord vibrations and vocal-tract-transmission characteristics of a human voicing mechanism. Namely, vocal-cord vibrations are emulated by a speech-source codebook, and the vocal-tract-transmission characteristics are emulated by a linear filter which uses LPC coefficients as filter coefficients. Differences between a synthesized speech signal and an input speech signal are minimized by adjusting indexes and gains which are used with respect to the speech-source codebook. Speech-source-code indexes and gain indexes of the gains which minimize the differences are output together with the indexes of the LPC coefficients.
In the speech coding device of FIG. 3, the speech analysis unit 2 includes an LPC analyzing unit 21 for analyzing LPC coefficients of input speech. The codebook 5 includes an LPC-coefficient codebook 51 containing LPC coefficients and representative vectors, and includes a stochastic codebook 52 and adaptive codebook 53 serving as a speech source. The quantizing unit 3 includes a code selecting unit 31, a synthesis filter 32, and an error minimizing unit 33.
The LPC analyzing unit 21 applies a window operation to an input speech signal so as to divide the input signal into a plurality of frames (frame 1, frame 2, frame 3, . . . ) having a predetermined time length. The LPC analyzing unit 21 further conducts the LPC analysis on each frame to obtain a plurality of LPC coefficients .alpha..sub.1 through .alpha..sub.n with respect to each frame.
The code selecting unit 31 selects quantization codes (codebook indexes) of the LPC coefficients .alpha..sub.1 through .alpha..sub.n from the LPC-coefficient codebook 51 based on the LPC coefficients .alpha..sub.1 through .alpha..sub.n obtained through the analysis of the input speech. Upon the selection of the quantization codes, the code selecting unit 31 outputs the quantization codes and quantized LPC coefficients .alpha..sub.q1 through .alpha..sub.qn corresponding to the quantization codes (subscripts "q" represents quantization). The quantized LPC coefficients .alpha..sub.q1 through .alpha..sub.qn differ from the LPC coefficients .alpha..sub.1 through .alpha..sub.n only in that less significant digits thereof are rounded.
The synthesis filter 32 uses the quantized LPC coefficients .alpha..sub.q1 through .alpha..sub.qn from the code selecting unit 31 as the filter coefficients. Based on these filter coefficients, the speech is synthesized by using excitation signals, which are generated based on representative vectors in the stochastic codebook 52 and the adaptive codebook 53. These excitation signals represent the vocal-cord vibrations, and the filter coefficients emulate the vocal-tract-transmission characteristics The vocal-tract-transmission characteristics are reflection characteristics of a portion extending from the throat to the lips and the nose. Here, the adaptive codebook 53 keeps updating previous signals.
The error minimizing unit 33 compares the input speech with the speech synthesized by the synthesis filter 32, and controls the codebook indexes and the gain indexes with respect to the stochastic codebook 52 and the adaptive codebook 53 so as to minimize differences between the input speech and the synthesized speech. Namely, the error minimizing unit 33 adjusts quality and magnitude of the vocal-cord vibrations such that the synthesized speech becomes equal to the input speech.
In the speech coding device, the code selecting unit 31 searches in the LPC-coefficient codebook 51 for the quantized LPC coefficients when the LPC coefficients are obtained through the analysis by the LPC analyzing unit 21.
FIG. 4 is an illustrative drawing showing an example of the LPC-coefficient codebook of FIG. 3. As shown in FIG. 4, representative coefficient values (quantized LPC coefficients) are provided for each of the LPC coefficients .alpha..sub.1 through .alpha..sub.n, and each of the representative coefficient values has an assigned index 01, 02, . . . , or so on. If the LPC coefficient .alpha..sub.1 obtained by the LPC analyzing unit 21 is 0.3984523, for example, the code selecting unit 31 searches in the LPC-coefficient codebook 51 to select a quantization code 02 and a quantized LPC coefficient of 0.398. The same operation is carried out for each of the LPC coefficients .alpha..sub.1 through .alpha..sub.n.
Based on the quantized LPC coefficients, the synthesis filter 32 synthesizes a speech signal, and the error minimizing unit 33 determines the indexes and gains b and g with respect to the adaptive codebook 53 and the stochastic codebook 52 such that differences between the synthesized speech and the input speech become a minimum The adaptive codebook 53 is used for emulating cycles (pitch, speech height) of vowels with regard to the vocal-cord vibrations, and the stochastic codebook 52 is used for emulating random vocal-cord vibrations representing consonants.
FIGS. 5A and 5B are table charts showing examples of the stochastic codebook 52 and the adaptive codebook 53 of FIG. 3. In the stochastic codebook 52 of FIG. 5A, for example, a series of figures (0.54, 0.78, 0.98, 0.65, . . . ) represents a temporal fluctuation of a signal. Namely, with an index being provided, a signal fluctuating over time is generated, as expressed by a series of figures corresponding to the provided index. The same applies in the adaptive codebook 53 of FIG. 5B. In this manner, signals having temporal fluctuations are extracted from the stochastic codebook 52 and the adaptive codebook 53, corresponding to codebook indexes provided from the error minimizing unit 33. These signals are subjected to changes in gain according to the gains g and b, and, then, are added together to be supplied to the synthesis filter 32. FIG. 6A is a table chart showing an example of the gain index g in the stochastic codebook 52 of FIG. 3, and FIG. 6B is a table chart showing an example of the gain index b in the adaptive codebook 53 of FIG. 3. As shown in FIGS. 6A and 6B, each gain has an assigned gain index 01, 02, . . . or so on.
A coding unit 41 receives the indexes (quantization codes) of the LPC-coefficient codebook 51, the codebook indexes and gain indexes with respect to the stochastic codebook 52, and the codebook indexes and gain indexes with respect to the adaptive codebook 53, all of which are obtained by the above-described process. The coding unit 41 multiplexes these indexes to generate encoded codes, which are modulated by a modulator (not shown) and transferred to the receiver side.
FIG. 7 is an illustrative drawing showing an example of the transferred encoded codes. A plurality of LPC coefficients and each one of the other types of indexes are put together to be transferred as a frame. FIG. 7 shows an example in which each frame contains five LPC coefficients. In FIG. 7, indexes of the adaptive codebook 53 are denoted by "i", and gain indexes for the adaptive codebook 53 are denoted by "b". Further, indexes of the stochastic codebook 52 are indicated by "j", and gain indexes with regard to the stochastic codebook 52 are represented by "g".
FIG. 8 is a block diagram of an example of a speech decoding device which employs the CELP method.
In the speech decoding device of FIG. 8, a decoding unit 61 is a circuit for separating a plurality of quantization codes multiplexed on the transmission side. The inverse-quantizing unit 7 includes a code selecting unit 71 for selecting representative vectors from the codebook 5 based on the separated quantization codes. The codebook 5 has the same structure as that on the transmission side, and the speech synthesizing unit 8 includes a synthesis filter 81 which is the same filter as that of the transmission side.
The speech decoding device carries out an inverse process of the process of the speech coding device. Namely, LPC coefficients, temporally fluctuating signals forming a basis of vocal-cord vibrations, and gains for the temporally fluctuating signals are searched for in the codebook 5 by using the quantization codes extracted by the decoding unit 61, and are used by the synthesis filter 81 to reproduce speech. The adaptive codebook 53 updates previous signals in the same manner as in the speech coding device.
In a digital mobile phones or the like, provision of a speech recognition function would enable speech dialing, for example, which allows a name of a person to be given as a speech input and to be recognized by the speech recognition function, and searching for a corresponding phone number to automatically phone this person. Thus, a convenient function is provided to replace conventional registered dialing.
In equipping a digital mobile phone or the like with a speech recognition function, a speech coding/decoding device used in the digital mobile phone may be utilized for implementing the speech recognition function. In doing so, a speech inputting unit as well as a speech analysis unit, if necessary, can be shared by both the speech recognition function and the speech coding/decoding device. However, a speech dictionary becomes necessary for the speech recognition in order to match speech inputs of names with phone numbers, thereby resulting in a memory-volume increase commensurate with the number of words and the length of items. Simply combining the speech recognition function with the speech coding/decoding function may create a situation in which the auxiliary speech recognition function ends up using a larger memory volume than the speech coding/decoding function. In consideration of this, it is desirable in practice to provide a speech recognition function for a speech coding/encoding device without significantly increasing a memory volume.
Accordingly, as a first point, there is a need to provide a speech processing device having both the speech coding/encoding functions and the speech recognition function without using a large memory volume.
When the speech recognition function is provided for the phones or the like, this function may be used for recognizing speech signals transmitted from the other end of the line for various purposes. In digital mobile phones or the like employing the CELP method, speech is quantized, coded, and transmitted to a receiver side, and, on the receiver side, the speech is synthesized and reproduced based on the received information. In such a case, the reproduced speech synthesized from the received information (quantization codes) needs to be used for speech recognition in the same manner as when original speech of an operator is used in the case of speech dialing. Since there are many intervening processing steps prior to the speech recognition, however, a recognition rate may deteriorate.
Accordingly, as a second point, there is a need to increase a speech-recognition rate of a speech processing device which has both the speech coding/decoding function and the speech recognition function implemented through use of a small volume memory.
In the automatic dialing for automatically making a phone call in response to a speech input, when a name of a person to call is recognized, there is a need to reconfirm whether a recognition result is correct before actually making a call. If the speech-recognition result is only indicated on a liquid-cristal display, however, an operator needs to move his/her eyes for confirmation of the displayed result In order to enhance the convenience of the digital mobile phones or the like equipped with the automatic dialing function, it would be better to have results of the speech recognition indicated by speech. If a speech synthesizing unit for this purpose is simply added to a device having the speech coding/decoding function and the speech recognition function, such an addition may lead to a cost increase.
Accordingly, as a third point, there is a need to incorporate a speech synthesizing function with respect to recognition results into a speech processing device which has both the speech coding/decoding function and the speech recognition function implemented through use of a small memory volume.
The present invention is directed to the problems described above, and is aimed at providing a speech processing device using a small memory volume while equipping this speech processing device with both the speech coding/decoding function and the speech recognition function.
Also, it is another object of the present invention to enhance a speech-recognition rate of the speech processing device which is equipped with both the speech coding/decoding function and the speech recognition function using a small memory volume.
Further, it is still another object of the present invention to efficiently incorporate a speech synthesizing function for recognition results into a speech processing device which has both the speech coding/decoding function and the speech recognition function using a small memory volume.