1. Field of the Invention
The invention relates to a speech duration processing method and apparatus for deciding the speech duration of synthesized speech to obtain good sound quality.
2. Description of the Related Art
Using Chinese as an example, the synthesizing units used in a Chinese speech synthesizing system are generally classified into two types: (1) monosyllabic (408 kinds, not including the four tones); and (2) phonemes (including 21 Chinese phonetic consonants and 38 vowels). Regardless of whether monosyllables or phonemes are used as synthesizing units, some factors, such as the phonemes, tones, phrase construction, locations in phrases, locations in sentences, and the front and rear connected phonemes, of the synthesizing units appropriately decide the speech duration of each of the synthesizing units, and can have a large affect on the degree of natural likeness of synthesized speech.
A conventional speech duration processing apparatus for Chinese text-to-speech system has been disclosed in R.O.C. Patent Application No. 80100559, entitled xe2x80x9cSpeech Duration Processing Apparatus for Text-to-Speech System.xe2x80x9d FIG. 9 is a block diagram illustrating a speech duration processing apparatus for determining the speech duration according to the phonemes, tones and the locations in the sentence. As shown in FIG. 9, 110 denotes a memory portion for storing different data. 120 denotes a pinyin sentence input portion for inputting pinyin sentences of any length and formed from pinyin markers and tone markers. 130 denotes a syllable inspecting portion for inspecting syllables in the sentence inputted from the pinyin sentence input portion 120 with the use of the tone markers. 150 denotes a syllable-phoneme look-up memory portion for storing phonemes composed from each of the syllables. 140 denotes a phoneme inspecting portion for inspecting the phonemes in the inputted pinyin sentence with the use of the syllable-phoneme look-up memory portion 150, and for inspecting the location of each phoneme in the sentence. 170 denotes a speech duration numerical data storage portion for storing speech duration count data defined according to class of the phoneme, tone of the phoneme, and location of the phoneme in the sentence. 160 denotes a speech duration inspecting portion for calculating a syllable speech duration by using the inspected phoneme designated number, tones of each of the phonemes and locations of each of the phonemes in the sentence as indexing keys to retrieve the speech duration numerical data of each of the phonemes from the speech duration count data storage portion 170.
In the aforesaid conventional speech duration processing apparatus, only the phonemes, tones and locations of the phonemes in the sentence are considered. As to whether or not the synthesizing units form phrases and the effect of the locations thereof in phrases on the speech duration should be considered as well. For example, in a three-character phrase, the speech duration of the second character in the phrase is the shortest, followed by that of the first character, and the speech duration of the third character is the longest. In the example , , , ,  forms a three-character phrase. The speech duration generated by the conventional speech duration processing apparatus for the first  character and the second  character is about 339 ms. However, the speech duration for natural language pronunciation as measured with the use of a sound registering instrument are 275 and 302 ms, respectively, thereby arising in a relatively large difference. Thus, the speech duration obtained by mere consideration of the phonemes, tones and the locations of the phonemes in the sentence are inaccurate and will result in lowering of the synthesized speech quality.
Therefore, the main object of the present invention is to provide a speech duration processing method and apparatus for Chinese text-to-speech system capable of overcoming the aforesaid drawback.
According to a first aspect of the invention, a speech duration processing method for Chinese text-to-speech system using Chinese phonemes as a basic processing unit, comprises:
constructing a dictionary for storing Chinese vocabulary and corresponding information, such as phonetic markers, parts of speech, expansion syntax, etc.;
constructing a syllable-phoneme look-up portion for storing information, such as phoneme designated numbers (including consonant designated numbers and vowel designated numbers) corresponding to each syllable for all of the Chinese syllables, etc.;
constructing a basic speech duration storage portion for storing basic speech duration information classified according to phonemes;
constructing a speech duration parameter storage portion for storing speech duration parameters according to tones of the syllables to which each of the phonemes belong, the phrase construction and the locations in the phrases, the locations in the sentence, and the class of the connected phonemes;
inspecting positions of the syllables of each vocabulary in an input sentence of any length by comparing with the vocabulary stored in the dictionary;
generating a phonetic representation of each syllable of each inspected vocabulary according to the phonetic markers stored in the dictionary;
inspecting the part of speech and the expansion syntax of each inspected vocabulary with reference to the dictionary;
combining the vocabulary in the sentence into phrases according to the expansion syntax and relationship of the parts of speech of adjacent ones of the vocabulary;
inspecting each syllable in the generated text phonetic markers with the use of tone markers;
inspecting the phoneme formation of each inspected syllable with reference to the information in the syllable-phoneme look-up portion;
retrieving the speech duration of each inspected phoneme from the basic speech duration storage portion; and
calculating the speech duration of each of the inspected phonemes that form each of the inspected syllables from the basic speech duration and the parameters associated with the tones, the phrase construction, the locations in the phrases, the locations in the sentence, and the class of the front and rear adjacent phonemes of the inspected phonemes, and tallying the speech duration of the inspected phonemes to obtain the speech duration of each of the inspected syllables.
According to a second aspect of the invention, a speech duration processing method for Chinese text-to-speech system using Chinese syllables as a basic processing unit, comprises:
constructing a dictionary for storing Chinese vocabulary and corresponding information, such as phonetic markers, parts of speech, expansion syntax, etc.;
constructing a basic speech duration storage portion for storing basic speech duration information classified according to the syllables;
constructing a speech duration parameter storage portion for storing speech duration parameters according to tones of each of the syllables, the phrase construction and the locations in the phrases, the locations in the sentence, and the class of the connected syllables;
inspecting positions of the syllables of each vocabulary in an input sentence of any length by comparing with the vocabulary stored in the dictionary;
generating a phonetic representation of each syllable of each inspected vocabulary according to the phonetic markers stored in the dictionary;
inspecting the part of speech and the expansion syntax of each inspected vocabulary with reference to the dictionary;
combining the vocabulary in the sentence into phrases according to the expansion syntax and relationship of the parts of speech of adjacent ones of the vocabulary;
inspecting each syllable in the generated text phonetic markers with the use of tone markers;
retrieving the speech duration of each inspected syllable from the basic speech duration storage portion; and
calculating the speech duration of each of the inspected syllables from the basic speech duration and the parameters associated with the tones, the phrase construction, the locations in the phrases, the locations in the sentence, and the class of the front and rear adjacent syllables of the inspected syllables.
According to a third aspect of the invention, a speech duration processing apparatus for Chinese text-to-speech system using Chinese phonemes as a basic processing unit, comprises:
a dictionary for storing Chinese vocabulary and corresponding information, such as phonetic markers, parts of speech, expansion syntax, etc.; a syllable-phoneme look-up portion for storing information, such as phoneme designated numbers (including consonant designated numbers and vowel designated numbers) corresponding to each syllable for all of the Chinese syllables, etc.;
a basic speech duration storage portion for storing basic speech duration information classified according to the phonemes;
a speech duration parameter storage portion for storing speech duration parameters according to tones of the syllables to which each of the phonemes belong, the phrase construction and the locations in the phrases, the locations in the sentence, and the class of the connected phonemes;
a vocabulary inspecting portion for inspecting positions of the syllables of each vocabulary in an input sentence of any length by comparing with the vocabulary stored in the dictionary;
a phonetic marker generating portion for generating a phonetic representation of each syllable of each inspected vocabulary according to the phonetic markers stored in the dictionary;
a part of speech/expansion syntax inspecting portion for inspecting the part of speech and the expansion syntax of each inspected vocabulary with reference to the dictionary;
a phrase expansion portion for combining the vocabulary in the sentence into phrases according to the expansion syntax and relationship of the parts of speech of adjacent ones of the vocabulary;
a tone/syllable inspecting portion for inspecting each syllable in the generated text phonetic markers with the use of tone markers;
a phoneme inspecting portion for inspecting the phoneme formation of each of the inspected syllables with reference to the information in the syllable-phoneme look-up portion;
a basic speech duration deciding portion for retrieving the speech duration of each of the inspected phonemes from the basic speech duration storage portion; and
a syllable speech duration calculating portion for calculating the speech duration of each of the inspected phonemes that form each of the inspected syllables from the basic speech duration and the parameters associated with the tones, the phrase constructions, the locations in the phrases, the locations in the sentence, and the class of the front and rear adjacent phonemes of the inspected phonemes, and for tallying the speech duration of the inspected phonemes to obtain the speech duration of each of the inspected syllables.
According to a fourth aspect of the invention, a speech duration processing apparatus for Chinese text-to-speech system using Chinese syllables as a basic processing unit, comprises:
a dictionary for storing Chinese vocabulary and corresponding information, such as phonetic markers, parts of speech, expansion syntax, etc.;
a basic speech duration storage portion for storing basic speech duration information classified according to the syllables;
a speech duration parameter storage portion for storing speech duration parameters according to tones of each of the syllables, the phrase construction and the locations in the phrases, the locations in the sentence, and the class of the connected syllables;
a vocabulary inspecting portion for inspecting positions of the syllables of each vocabulary in an input sentence of any length by comparing with the vocabulary stored in the dictionary;
a phonetic marker generating portion for generating a phonetic representation of each syllable of each inspected vocabulary according to the phonetic markers stored in the dictionary;
a part of speech/expansion syntax inspecting portion for inspecting the part of speech and the expansion syntax of each inspected vocabulary with reference to the dictionary;
a phrase expansion portion for combining the vocabulary in the sentence into phrases according to the expansion syntax and relationship of the parts of speech of adjacent ones of the vocabulary;
a tone/syllable inspecting portion for inspecting each syllable in the generated text phonetic markers with the use of tone markers;
a basic speech duration deciding portion for retrieving the speech duration of each inspected syllable from the basic speech duration storage portion; and
a syllable speech duration calculating portion for calculating the speech duration of each of the inspected syllables from the basic speech duration and the parameters associated with the tones, the phrase constructions, the locations in the phrases, the locations in the sentence, and the class of the front and rear adjacent syllables of the inspected syllables.
According to the data construction and processing steps of the speech duration processing method of the first aspect of the invention, any length of a Chinese sentence waiting to be speech synthesized initially undergoes a vocabulary inspecting step, where the positions of the syllables of each vocabulary in the sentence are inspected by comparing with the vocabulary stored in a previously constructed dictionary. Then, each inspected vocabulary undergoes a phonetic marker generating step to generate a phonetic representation of each syllable according to the phonetic markers stored in the dictionary. Subsequently, via a part of speech/expansion syntax inspecting step, the part of speech and the expansion syntax of each vocabulary are inspected with reference to the dictionary. Further, in a phrase expansion step, adjacent ones of the vocabulary in the sentence are combined into phrases according to the expansion syntax and relationship of the parts of speech. Thereafter, via a tone/syllable inspecting step, each syllable in the generated phonetic markers of the sentence are inspected with the use of tone markers. Then, in a phoneme inspecting step, the phoneme formation of each syllable is inspected with reference to a previously constructed syllable-phoneme look-up portion. Subsequently, via a basic speech duration deciding step, the speech duration of each phoneme is inspected with reference to a previously constructed basic speech duration storage portion. Finally, in a syllable speech duration calculating step, the speech duration of each of the phonemes that form each of the syllables in the sentence is calculated from the basic speech duration and the parameters associated with the tones, the phrase constructions, the locations in the phrases, the locations in the sentence, and the class of the front and rear adjacent phonemes of the phoneme formation, and the speech duration of the phonemes that comprise each syllable are tallied to obtain the syllable speech duration. From the result, a syllable speech duration that complies with natural speech can be obtained for the Chinese sentence waiting to be speech synthesized.
According to the data construction and processing steps of the speech duration processing method of the second aspect of the invention, any length of a Chinese sentence waiting to be speech synthesized initially undergoes a vocabulary inspecting step, where the positions of the syllables of each vocabulary in the sentence are inspected by comparing with the vocabulary stored in a previously constructed dictionary. Then, each inspected vocabulary undergoes a phonetic marker generating step to generate phonetic of each syllable according to the phonetic markers stored in the dictionary. Subsequently, via a part of speech/expansion syntax inspecting step, the part of speech and the expansion syntax of each vocabulary are inspected with reference to the dictionary. Further, in a phrase expansion step, adjacent ones of the vocabulary in the sentence are combined into phrases according to the expansion syntax and relationship of the parts of speech. Thereafter, via a tone/syllable inspecting step, each syllable in the generated phonetic markers of the sentence are inspected with the use of tone markers. Then, in a basic speech duration deciding step, the speech duration of each syllable is inspected with reference to a previously constructed basic speech duration storage portion. Finally, in a syllable speech duration calculating step, the syllable speech duration of each of the syllables in the sentence is calculated from the basic speech duration and the parameters associated with the tones, the phrase constructions, the locations in the phrases, the locations in the sentence, and the class of the front and rear adjacent syllables. From the result, a syllable speech duration that complies with natural speech can be obtained.
According to the construction of the speech duration processing apparatus of the third aspect of the invention, after any length of a Chinese sentence is inputted into the apparatus, a vocabulary inspecting portion inspects the positions of the syllables of each vocabulary in the sentence by comparing with the vocabulary stored in a previously constructed dictionary. Then, a phonetic marker generating portion inspects each vocabulary to generate phonetic of each syllable according to the phonetic markers stored in the dictionary. Subsequently, via a part of speech/expansion syntax inspecting portion, the part of speech and the expansion syntax of each vocabulary are inspected with reference to the dictionary. Further, via a phrase expansion portion, adjacent ones of the vocabulary in the sentence are combined into phrases according to the expansion syntax and relationship of the parts of speech. Thereafter, via a tone/syllable inspecting portion, each syllable in the generated phonetic markers of the sentence are inspected with the use of tone markers. Then, via a phoneme inspecting portion, the phoneme formation of each syllable is inspected with reference to a previously constructed syllable-phoneme look-up portion. Subsequently, via a basic speech duration deciding portion, the speech duration of each phoneme is inspected with reference to a previously constructed basic speech duration storage portion. Finally, via a syllable speech duration calculating portion, the speech duration of each of the phonemes that form each of the syllables in the sentence is calculated from the basic speech duration and the parameters associated with the tones, the phrase constructions, the locations in the phrases, the locations in the sentence, and the class of the front and rear adjacent phonemes of the phoneme formation, and the speech duration of the phonemes that comprise each syllable are tallied to obtain the syllable speech duration. The syllable speech duration is outputted for use.
According to the construction of the speech duration processing apparatus of the fourth aspect of the invention, after any length of a Chinese sentence is inputted into the apparatus, a vocabulary inspecting portion inspects the positions of the syllables of each vocabulary in the sentence by comparing with the vocabulary stored in a previously constructed dictionary. Then, a phonetic marker generating portion inspects each vocabulary to generate phonetic of each syllable according to the phonetic markers stored in the dictionary. Subsequently, via a part of speech/expansion syntax inspecting portion, the part of speech and the expansion syntax of each vocabulary are inspected with reference to the dictionary. Further, via a phrase expansion portion, adjacent ones of the vocabulary in the sentence are combined into phrases according to the expansion syntax and relationship of the parts of speech. Thereafter, via a tone/syllable inspecting portion, each syllable in the generated phonetic markers of the sentence are inspected with the use of tone markers. Then, via a basic speech duration deciding portion, the speech duration of each syllable is inspected with reference to a previously constructed basic speech duration storage portion. Finally, via a syllable speech duration calculating portion, the syllable speech duration of each of the syllables in the sentence is calculated from the basic speech duration and the parameters associated with the tones, the phrase constructions, the locations in the phrases, the locations in the sentence, and the class of the front and rear adjacent syllables. The syllable speech duration is outputted for use.