In recent years, techniques for compressing speech signals have been used frequently in speech communication using cellular phones and the like. Specific application areas include mainly CODEC (COder/DECoder), speech recognition and speech synthesis.
Methods for compressing speech signals are broadly classified as methods using human acoustic functions and methods using characteristics of vocal bands.
The methods using acoustic functions include MP3 (MPEG1 audio layer 3), ATRAC (Adaptive TRansform Acoustic Coding) and AAC (Advanced Audio Coding). The method using acoustic functions is characterized in that sound quality is high although the compressibility ratio is low, and is often used for compressing music signals.
On the other hand, the method using characteristics of vocal bands is a method that is used for compressing a speech sound, and is characterized in that the compressibility ratio is high although sound quality is low. The methods using characteristics of vocal bands include methods using linear prediction coding, specifically CELP and ADPCM (Adaptive Differential Pulse Code Modulation).
In the case where the speech sound is compressed by the method using linear prediction coding, generally a pitch of the speech sound (inverse of a fundamental frequency) should be extracted for performing linear prediction coding. For this purpose, previously, the pitch has been extracted using methods using Fourier transformation such as cepstrum analysis.
In the case where the pitch is extracted by the method using Fourier transformation, the fundamental frequency is selected from frequencies at which spectrum peaks occur, and the inverse of the fundamental frequency is identified as a pitch.
The spectrum can be obtained by carrying out the FFT (Fast Fourier Transform) operation and the like. For obtaining the spectrum by the FFT operation, generally sampling of the speech sound should be carried out over a time period longer than that equivalent to one pitch of the speech sound.
The longer the time period over which sampling of the speech sound is carried out, the higher is the possibility that a steep change in wave is caused due to the switching of the speech sound and the like while the sampling is continuously carried out. If the steep change in wave occurs while the sampling is carried out, an error included in the pitch frequency to be identified in processing subsequent to the sampling will be significant.
In addition, fluctuations are included in the length of the pitch of human voice. This fluctuation may cause the error in the pitch frequency. That is, the speech sound including fluctuations is sampled over a time period equivalent to several pitches, and as a result, the fluctuations are evened, and thus the identified pitch frequency is different from an actual pitch frequency including fluctuations.
If the speech signal is compressed based on the pitch value with fluctuations evened, not only a machinery speech sound is produced but also sound quality is reduced when the speech signal is expanded and played back.
The present invention has been devised in view of the above situations, and has as its first object provision of a pitch wave signal creating apparatus and a pitch wave signal creation method effectively functioning as preliminary processing for efficiently coding a speech wave signal including pitch fluctuations.
Next, in recent years, terminals for performing digital speech communications such as cellular phones have been widely used.
There are cases where such terminals are used for communications with the speech signal compressed using the method of LPC (Linear Prediction Coding) such as CELP (Code Excited Linear Prediction).
In the case where the method of linear prediction coding is used, the speech sound is compressed by coding the vocal tract characteristic (frequency characteristic of vocal tract) of human voice. For playing back the speech sound, a table having this code as a key is searched.
When this method is applied for cellular phones and the like, however, sound quality is often reduced, thus making it difficult to recognize the voice of a speech communication partner if the number of codes is small.
For improving sound quality in the method of linear prediction coding, the number of elements of the vocal tract characteristic registered in the table may be increased. In the method of increasing the number of the elements, however, both the amount of data to be transmitted and the amount of data in the table are considerably increased. Therefore, the efficiency of compression is compromised, and it is difficult to store the table in a terminal capable of bearing only small apparatus.
In addition, the actual vocal tract of human being has a very complicated structure, and the frequency characteristic of the vocal tract fluctuates with time. Thus, the pitch of the speech sound has fluctuations. Therefore, even though human voice is simply subjected to Fourier transformation, the characteristic of the vocal tract cannot be accurately determined. Thus, if linear prediction coding is carried out using the characteristic of the vocal tract determined based on the result of simply subjecting human voice to Fourier transformation, sound quality cannot be satisfactorily improved even though the number of elements of the table is increased.
This invention has been devised in view of the above situations, and has as its second object provision of a speech signal compressing/expanding apparatus and a speech signal compression/expansion method for efficiently compressing data representing a speech sound or compressing data representing a speech sound having fluctuations in high sound quality.
In addition, methods for synthesizing a speech sound include so called a rule synthesis method. The rule synthesis method is a method in which pitch information and spectrum envelope information (vocal tract characteristic) are determined based on information obtained as a result of morphological analysis of a text and rhythm prediction coding, and a speech sound reading this text is synthesized based on the determination result.
Specifically, as shown in FIG. 8 for example, a text for which a speech sound is synthesized is first subjected to morphological analysis (step S101 in FIG. 8), a row of pronouncing symbols showing the pronounce of the speech sound reading the text is created based on the result of the morphological analysis (step S102), and a row of rhythm symbols showing the rhythm of this speech sound is created (step S103).
Then, the envelope of the spectrum of the speech sound is determined based on the obtained row of pronounce symbols (step S104), the characteristic of a filter simulating the characteristic of the vocal tract is determined based on this envelope. On the other hand, a sound source parameter showing the characteristic of the sound produced by the vocal band is created based on the obtained row of rhythm symbols (step S105), and a sound source signal showing the wave of the sound produced by the vocal band is created based on the sound source parameter (step S106).
Then, this sound source signal is filtered by the filter determining the characteristic (step S107), whereby the speech sound is synthesized.
For synthesizing the speech sound, the sound source signal is simulated by switching between an impulse row generated by an impulse row source 1 and a white noise generated by a white noise source 2 as shown in FIG. 9. Then, this sound source signal is filtered by a digital filter 3 simulating the characteristic of the vocal tract to create the speech sound.
However, the actual vocal band of human being has a complicated structure, and makes it difficult to show the characteristic of the vocal band by the impulse row. Therefore, the speech sound synthesized by the above described rule synthesis method tends to be a machinery speech sound dissimilar to the actual speech sound produced by man.
Also, the structure of the vocal tract is complicated, and thus it is difficult to accurately predict the spectrum envelope, and hence it is difficult to show the characteristic of the vocal tract by the digital filter. This is also a cause of reduction in sound quality of the speech sound synthesized by the rule synthesis method.
This invention has been devised in view of the above situations, and has as its third object provision of a speech synthesizing apparatus, a speech dictionary creating apparatus, a speech synthesis method and a speech dictionary creation method for efficiently synthesizing natural speech sounds.