This invention relates to an apparatus for encoding voice spectrum envelop parameters which forms a phoneme matrix by combining a certain number of phoneme vectors, and which effects matrix quantization by using this phoneme matrix as a unit.
FIG. 1 is a block diagram of an example of a conventional voice spectrum envelop parameter encoder described on pages 1427-1439 of IEEE Transaction on Acoustic, Speech, and Signal Processing, volume ASSP-34, No. 6 (December, 1986).
Referring to FIG. 1, phoneme vectors which are parameters representing information on the spectrum envelop of an input voice and which are obtained by analyzing the input voice signal for a certain period of time (e.g., 10 msec) for each analysis frame are input through an input terminal 1. A phoneme matrix formation means 2 serves to form a phoneme matrix by combining, in time-direction, L phoneme vectors input through the input terminal 1. Finite M typical phoneme matrix code words are stored in a code book 3. A changeover switch 4 serves to successively read out M phoneme matrix code words stored in the code book 3.
A distance calculation means 5 serves to calculate the distance between the phoneme matrix supplied from the phoneme matrix formation means 2 and each of the phoneme matrix code words successively read from the code book 3 through the changeover switch 4. An optimum phoneme matrix code word selection means 6 serves to compare the distances calculated by the distance calculation means 5, to thereby select the phoneme matrix code word of the smallest distance value as an optimum phoneme matrix code word, and to output the number of the optimum phoneme matrix code word. The optimum phoneme matrix code word number is output through an output terminal 7.
The operation of this encoder will be described below. When phoneme vectors, i.e., parameters representing information on the spectrum envelop of an input voice are input through the input terminal 1, the phoneme matrix formation means 2 accumulates input phoneme vectors with respect to groups of certain L frames, and outputs a phoneme matrix composed of L phoneme vectors for each group of L frames. This phoneme matrix is supplied from the phoneme matrix formation means 2 to the distance calculation means 5. On the other hand, M phoneme matrix code words stored in the code book 3 are successively read out through the changeover switch 4 to be input into the distance calculation means 5.
The distance calculation means 5 successively calculates the distances between the phoneme matrix supplied from the phoneme matrix formation means 2 and the phoneme matrix code words successively supplied through the changeover switch 4. Euclidean distance, for example, is used as the measure for this distance calculation. The results of calculation are supplied to the optimum code word selection means 6 to be compared, and the phoneme matrix code word of the smallest distance value is selected as an optimum phoneme matrix code word. The code word number of this optimum phoneme matrix code word is output as an optimum phoneme matrix code word number through the output terminal 7 by the optimum code word selection means 6.
The decoder has the same code book as the above-described code book and has a reverse quantization means which receives the optimum phoneme matrix code word number, reads out a phoneme matrix code word thereby designated, decomposes the same into L output phoneme vectors, and outputs these vectors.
However, the optimum phoneme matrix code word having the smallest distance on the phoneme matrices does not always coincide with the phoneme matrix code word which is closest to the input voice in terms of phonemic characteristics. FIGS. 2(a) to (c) are diagrams of an example of such a case, which schematically show a phoneme matrix formed by combining phoneme vectors one-dimensionally for five frames. FIG. 2(a) shows a phoneme matrix to be encoded, FIG. 2(b) shows encoding of this matrix with a phoneme matrix code word A, and FIG. 2(c) shows encoding of this matrix with a different phoneme matrix code word B. The abscissa represents time while the ordinate represents the phoneme vector value.
As shown in these diagrams, in the case of coding with the phoneme matrix code word A, the synthesized voice does not maintain phonemic characteristics of the input voice well. In contrast, in the case of coding with the phoneme matrix code word B, the synthesized voice maintains phonemic characteristics of the input voice well, although a slight difference in time-direction is observed. However, with respect to the distance to the phoneme matrix which is the object of encoding, the distance dA from the phoneme matrix code word A is smaller than the distance dB from the phoneme matrix code word B. Accordingly, the phoneme matrix code word A is selected as an optimum phoneme matrix code word. The selection is greatly influenced by deformation in time-direction, and there is a substantially large possibility of selection of a phoneme matrix code word showing incorrect phonemic characteristics.
To solve this problem, a type of a system has been proposed in which the object phoneme matrix is encoded not on fixed time length but on variable time length, and in which information on the duration time of each phoneme matrix is transmitted along with the optimum matrix code number. An example of this system is reported in the voice study society materials of Nihon Onkyo Gakkai (data number S84-45, Nov. 22, 1985).
In this system, linear compression/expansion of phoneme matrix code words in the code book is effected by dynamic programming so that an optimum envelop is obtained with respect to a series of input phoneme vectors, the optimum phoneme matrix code word and the duration time of the same are selected to perform encoding. The distance at the time of encoding is thereby reduced so that the phonemic characteristics are suitably maintained.
The conventional voice spectrum envelop parameter encoders are constructed as described above. In the case of the encoder shown in FIG. 1, there is a substantially large possibility of selection of a phoneme matrix code word showing incorrect phonemic characteristics because of the influence of deformation in time-direction. The system in which information on the duration time of each phoneme matrix is transmitted along with the optimum matrix code word enables phonemic characteristics to be suitably maintained, but it cannot be directly applied to a real time communication system in which transmission is effected in fixed frame cycles, and it entails the problem of a very large amount of processing operation and, hence, the problem of an increase in delay time.