The present invention relates to a speech synthesis system in which arbitrary input texts, input phonetic characters, or the like are converted into synthesized speech to be output therefrom.
In recent years, synthesized speech has been widely used in electric home appliances and various electronic appliances such as vehicle navigation systems and mobile phones, in which various speech messages such as conditions of the appliances, instructions for operation, and response messages, are voiced by synthesized speeches. In addition, synthesized speeches have begun to be employed in personal computers or the like for such purposes as operating the apparatuses by way of a voice interface and confirming the result of text recognition by optical character recognition (OCR).
One of the techniques for performing such a speech synthesis is that speech data are stored in a system in advance and the stored data are played back when required. This technique is widely used in cases where a limited number of messages are to be vocalized. However, when a system according to this technique is applied to generate arbitrary speeches, the system requires a large capacity storage system, which inevitably makes the system costly and thus limiting the application thereof.
Another technique that is used in relatively less expensive systems than the above is such a system wherein, based on input texts or phonetic character strings, speech data are generated using a predetermined speech data generating rule. However, by this technique that utilizes the speech data generating rule, it is difficult to generate natural sounding speeches with various kinds of expressions.
In view of these problems, Japanese Unexamined Patent Publication No. 8-87297, for example, discloses a speech synthesis system that employs both the speech synthesis by retrieving speech data from a database and the speech synthesis by using a speech sound generating rule. More specifically, this type of apparatus has, as shown in FIG. 13, a text input section 910, a speech information database 920 storing speech parameters and corresponding speech content data, the speech parameters being obtained by analyzing actual speech and extracting data therefrom, a speech data retrieving section 930 retrieving data from the speech information database 920, a speech sound generating section 940 generating a speech waveform, a speech sound generating rule 950 including a rule for generating a speech parameter from the input text or the input phonetic character string, and an electroacoustic transducer 960. This speech synthesis system operates in the following manner. If a text or a phonetic character string is inputted into the text input section 910, the speech data retrieving section 930 retrieves from the speech information database 920 speech data having speech content that matches the input text or the input phonetic character string. If a matching speech content is present in the database, corresponding speech data is transmitted to the speech sound generating section 940. If the matching speech content is absent, the speech data retrieving section 930 transmits the input text or the input phonetic character string as it is to the speech sound generating section 940. When the speech sound generating section 940 receives the retrieved speech data, the speech sound generating section 940 generates a synthesized speech based on the retrieved speech data. Alternatively, when the speech sound generating section 940 receives the input text or the input phonetic character string, the speech sound generating section 940 generates speech parameters based on the input text or input phonetic character string and the speech sound generating rule 950, and thereafter generates a synthesized speech.
By using the speech data retrieval and the speech sound generating rule as described above, an arbitrary input text can be converted into a synthesized speech to be outputted, and for a limited portion of the speech (where the retrieval can find a successful match), a natural sounding speech can be obtained.
One of the drawbacks of the above-described prior art speech synthesis system is that there is a large difference in the sound quality between a synthesized speech in which the search has found a successful match and a synthesized speech in which the search has not found a successful match, that is, between a case where a speech content data corresponding to the input text or the like is present in the speech information database and a case where the corresponding speech content data is absent. In addition, by concatenating such speeches having different sound qualities, the resulting synthesized speech becomes further unnatural. Further, the retrieval from the speech information database 920 is performed by simply detecting the presence or absence of matching between the input phonetic character string and the stored speech content data, and therefore when a matching speech content data is present in the database, the speech synthesis is performed based on the retrieved data, regardless of other actors such as construction of the sentence, also leading to unnatural synthesized speech.
Specifically, assume that the system is required to synthesize a sentence in Japanese xe2x80x9c (which is transcribed in the Roman alphabet as xe2x80x98Osaka ni sunde iru watashi wa Matsushita desuxe2x80x99, which means that xe2x80x98I, who live in Osaka, am Matsushita.xe2x80x99)xe2x80x9d, for example. In this case, if the proper noun xe2x80x9cMatsushitaxe2x80x9d is absent in the database, the corresponding portion of the speech tends to become a mechanical sounding synthesized speech. Also, when the speech content data corresponding to the clause xe2x80x9cOsaka ni sundeiruxe2x80x9d which is stored as a speech data of the end of a sentence is used to construct the required sentence, the resulting speech tends to become an unnatural sounding synthesized speech such that two separate sentences xe2x80x9c (xe2x80x98osaka ni sunde iruxe2x80x99, meaning xe2x80x98I live in Osakaxe2x80x99)xe2x80x9d and xe2x80x9c (xe2x80x98watashi wa Matsushita desuxe2x80x99, meaning xe2x80x98I am Matsushitaxe2x80x99)xe2x80x9d are unnaturally concatenated.
In view of the foregoing and other drawbacks of prior art, it is an object of the present invention to provide a speech synthesis system capable of generating natural sounding synthesized speeches from arbitrary input texts, particularly a speech synthesis system capable of generating natural sounding synthesized speech having a good sound quality regardless of whether or not the speech information (prosodic information) database contains speech content data that matches the input text.
This and other objects are accomplished, in a first aspect of the present invention, by the provision of a speech synthesis system for generating a synthesized speech based on input data representing a speech to be synthesized, the system comprising:
a database storing prosodic data for use in synthesizing speech, the prosodic data corresponding to key data being used as a retrieval key;
means for retrieving the prosodic data according to a degree of matching between the input data and the key data;
means for modifying the prosodic data retrieved by the means for retrieving based on the input data, the degree of matching between the input data and the key data, and a predetermined modifying rule; and
means for synthesizing a synthesized speech based on the input data and the prosodic data modified by the means for modifying.
A second to a six aspects of the invention are as follows. The input data and the key data may include a phonetic character string representing a phonetic attribute of the speech to be synthesized, and further include linguistic data representing a linguistic attribute of the speech to be synthesized. The phonetic character string may include a data substantially indicating at least one of a phonological segment string of the speech to be synthesized, an accent position in the speech to be synthesized, and either one of the presence or absence and the length of a pause in the speech to be synthesized. Further, the linguistic data may include at least one of syntactic data and semantic data of the speech to be synthesized.
In addition, the speech synthesis system may further comprise a language processing means parsing a text data inputted in the speech synthesis system and producing the phonetic character string and the linguistic data.
By employing the above configurations of the invention, even when even where the database does not contain such prosodic data that the input data and the key data exactly match, a speech synthesis system can perform speech synthesis by using similar prosodic data, achieving a reasonably appropriate, smooth, and natural sounding speech based on arbitrary input data. Alternatively, the system can reduce a required storage capacity of the database without causing degradation in naturalness of the synthesized speech. Furthermore, where similar prosodic data are used as mentioned above, the prosodic data are modified according to a degree of similarity thereof, and therefore, more appropriate synthesized speech can be produced.
A seventh to a 15th aspects of the invention are as follows. In accordance with a seventh aspect of the invention, there is provided a speech synthesis system according to the first aspect of the invention, wherein each of the input data and the key data substantially includes a phonological segment category string representing a phonological segment category to which a phonological segment in the speech to be synthesized belongs.
Further, a speech synthesis system according to the invention may further comprises means for converting data into the phonological segment category string, the data being at least one data of data corresponding to the input data inputted to the speech synthesis system and data corresponding to the key data stored in the database.
The phonological segment category may be such that phonological segments are categorized by using at least one of a manner of articulation thereof, a place of articulation thereof, and a duration thereof.
The phonological segment category may also be such that prosodic patterns are grouped by using a statistical method such as a multivariate analysis or the like, and that the phonological segments are grouped so as to best reflect the grouped prosodic patterns.
The phonological segment category may also be such the phonological segments are grouped according to a distance between the phonological segments each other, the distance being determined based on a confusion matrix by using a statistical method such as a multivariate analysis.
The phonological segment category may also be such that the phonological segments are grouped according to a similarity of a physical characteristic between the phonological segments, such as a fundamental frequency thereof, an intensity thereof, a duration thereof, and a spectrum thereof.
By employing the above-described configurations of the invention, when the phonemes do not match but the phonological segment categories match each other in the retrieval of prosodic data, an appropriate and natural sounding speech can be produced in most cases by utilizing the prosodic data of non-matching phonemes.
In accordance with a 16th aspect of the invention, there is provided a speech synthesis system according to the first aspect of the invention, wherein the prosodic data stored in the database includes prosodic feature data extracted from an identical actual human voice.
In accordance with a 17th aspect of the invention, there is provided a speech synthesis system according to the 16th aspect of the invention, wherein the prosodic feature data include at least one of:
a fundamental frequency pattern representing a variation of a fundamental frequency with respect to time;
a voice intensity pattern representing a variation of a voice intensity with respect to time;
a phonological segment duration pattern representing a duration of a phonological segment; and
a pause data representing one of the presence or absence of a pause and the length of a pause.
In accordance with a 18th aspect of the invention, there is provided a speech synthesis system according to the first aspect of the invention, wherein the prosodic data are stored in the database such that each prosodic data forms a prosody controlling unit.
In accordance with a 19th aspect of the invention, there is provided a speech synthesis system according to the 18th aspect of the invention, wherein the prosody controlling unit comprises one of:
an accent phrase;
a phrase comprising one or more accent phrase;
a bunsetsu;
a phrase comprising one or more bunsetsus;
a word;
a phrase comprising one or more words;
a stress phrase; and
a phrase comprising one or more stress phrases.
By employing the above-described configuration of the invention, a system according to the invention can easily achieve an appropriate and natural sounding synthesized speech.
In accordance with a 20th aspect of the invention, there is provided a speech synthesis system according to the first aspect of the invention, wherein:
each of the input data and the key data comprises a plurality of types of speech indices each being a factor in determining a speech to be synthesized; and
the degree of matching between the input data and the key data is such that in each type of the speech indices, a degree of matching between the input data and the key data is weighted, and the weighted data are combined together.
In accordance with a 21st aspect of the invention, there is provided a speech synthesis system according to the 20th aspect of the invention, wherein the speech indices include a data substantially indicating at least one of a phonological segment string of the speech to be synthesized, an accent position in the speech to be synthesized, a linguistic data representing a linguistic attribute of the speech to be synthesized and one of the length of a pause and the presence or absence in the speech to be synthesized.
In accordance with a 22nd aspect of the invention, there is provided a speech synthesis system according to the 21st aspect of the invention, wherein:
the speech indices include a data substantially indicating a phonological segment string of the speech to be synthesized; and
the degree of matching between the speech indices in the input data and the speech indices in the key data includes a degree of similarity of acoustic feature data between phonological segments.
In accordance with a 23rd aspect of the invention, there is provided a speech synthesis system according to the 20th aspect of the invention, wherein the speech indices substantially include a phonological segment category string representing a phonological segment category to which a phonological segment in the speech to be synthesized belongs.
In accordance with a 24th aspect of the invention, there is provided a speech synthesis system according to the 23rd aspect of the invention, wherein the degree of matching between the speech indices in the input data and the speech indices in the key data includes a degree of similarity of the phonological segment category between the phonological segments.
By employing the above configurations of the invention, the retrieving and modifying of prosodic data can be easily performed in an appropriate manner.
In accordance with a 25th aspect of the invention, there is provided a speech synthesis system according to the 20th aspect of the invention, wherein the prosodic data includes a plurality of types of prosodic feature data characterizing the speech to be synthesized.
In accordance with a 26th aspect of the invention, there is provided a speech synthesis system according to the 25th aspect of the invention, wherein the database stores the plurality of types of prosodic feature data in such a manner that the plurality of types of prosodic feature data constitute a set of prosodic feature data.
In accordance with a 27th aspect of the invention, there is provided a speech synthesis system according to the 26th aspect of the invention, wherein the plurality of types of prosodic feature data are extracted from an identical actual human voice.
In accordance with a 28th aspect of the invention, there is provided a speech synthesis system according to the 25th aspect of the invention, wherein the prosodic feature data includes at least one of:
a fundamental frequency pattern representing a variation of a fundamental frequency with respect to time;
a voice intensity pattern representing a variation of a voice intensity with respect to time;
a phonological segment duration pattern representing a duration of a phonological segment; and
a pause data representing one of the presence or absence of a pause and the length of a pause.
In accordance with a 29th aspect of the invention, there is provided a speech synthesis system according to the 28th aspect of the invention, wherein the phonological segment duration pattern includes at least one of a phoneme duration pattern, a mora duration pattern, and a syllable duration pattern.
In accordance with a 30th aspect of the invention, there is provided a speech synthesis system according to the 25th aspect of the invention, wherein each of the plurality of types of prosodic feature data is retrieved and modified according to the weighted degrees of matching between the input data and the key data, the weighted degrees of matching being different from each other.
In accordance with a 31st aspect of the invention, there is provided a speech synthesis system according to the 20th aspect of the invention, wherein the retrieving the prosodic data and the modifying the prosodic data are performed each using a different weighted degree of matching between the input data and the key data.
In accordance with a 32nd aspect of the invention, there is provided a speech synthesis system according to the 20th aspect of the invention, wherein the retrieving the prosodic data and the modifying the prosodic data are performed using an identical weighted degree of matching between the input data and the key data.
In accordance with a 33rd aspect of the invention, there is provided a speech synthesis system according to the first aspect of the invention, wherein the means for modifying modifies the prosodic data retrieved by the means for retrieving based on a degree of matching between one of:
each phoneme;
each mora;
each syllable;
each unit of generating a speech waveform in the means for synthesizing; and
each phonological segment.
By employing the above-described configuration of the invention, modifying the prosodic data is easily performed in an appropriate manner.
In accordance with a 34th aspect of the invention, there is provided a speech synthesis system according to the 33rd aspect of the invention, wherein the degree of matching is determined based on at least one of:
a distance based on an acoustic characteristic;
a distance obtained from one of a manner of articulation, a place of articulation, and a duration; and
a distance based on a confusion matrix obtained by an auditory experiment.
In accordance with a 35th aspect of the invention, there is provided a speech synthesis system according to the 34th aspect of the invention, wherein the acoustic characteristic is at least one characteristic of the phonological segments selected from a fundamental frequency thereof, an intensity thereof, a duration thereof, and a spectrum thereof.
In accordance with a 36th aspect of the invention, there is provided a speech synthesis system according to the first aspect of the invention, wherein the database stores key data and prosodic data of a plurality of types of languages.
By employing the above configuration of the invention, a synthesized speech containing a plurality of languages can be easily produced.
In accordance with a 37th aspect of the invention, there is provided a method of synthesizing a speech based on input data representing a speech to be synthesized, the method comprising:
retrieving a prosodic data from a database in which a prosodic data for use in synthesizing a speech is stored corresponding to a key data for use in retrieval, the prosodic data retrieved according to a degree of matching between the input data and the key data;
modifying the retrieved prosodic data based on the degree of matching between the input data and the key data and a predetermined modifying rule; and
outputting a synthesized speech based on the input data and the modified prosodic data.
In accordance with a 38th aspect of the invention, there is provided a method of synthesizing a speech according to the 37th aspect of the invention, wherein each of the input data and the key data includes a plurality of types of speech indices each being a factor in determining a speech to be synthesized;
the degree of matching between the input data and the key data is such that in each type of the speech indices, a degree of matching between the input data and the key data is weighted, and the weighted data are combined together.
In accordance with a 39th aspect of the invention, there is provided a method of synthesizing a speech according to the 38th aspect of the invention, wherein the prosodic data includes a plurality of types of prosodic feature data characterizing the input data.
In accordance with a 40th aspect of the invention, there is provided a method of synthesizing a speech according to the 39th aspect of the invention, wherein each of the plurality of types of prosodic feature data is retrieved and modified according to the weighted degrees of matching between the input data and the key data, the weighted degrees of matching being different from each other.
In accordance with a 41st aspect of the invention, there is provided a method of synthesizing a speech according to the 38th aspect of the invention, wherein the retrieving the prosodic data and the modifying the prosodic data are performed each using a different weighted degree of matching between the input data and the key data.
In accordance with a 42nd aspect of the invention, there is provided a method of synthesizing a speech according to the 38th aspect of the invention, wherein the retrieving the prosodic data and the modifying the prosodic data are performed using an identical weighted degree of matching between the input data and the key data.
By employing the above-described methods according to the invention, even where the database does not contain such prosodic data that the input data and the key data exactly match, the speech synthesis system can perform speech synthesis by using similar prosodic data, achieving a reasonably appropriate, smooth, and natural sounding speech based on arbitrary input data. Alternatively, the system can reduce a required storage capacity of the database without causing degradation in naturalness of the synthesized speech. Furthermore, where similar prosodic data are used as mentioned above, the prosodic data are modified according to a degree of similarity thereof, and therefore, more appropriate synthesized speech can be produced.
In accordance with a 43rd aspect of the invention, there is provided a speech synthesis system wherein an input text is converted into a synthesized speech to be outputted, the system comprising:
a language processing means wherein the input text is parsed so as to output a phonetic character string and linguistic data;
a prosodic information database storing prosodic feature data, linguistic data, and a phonetic character string so that the prosodic feature data correspond to the linguistic data and the phonetic character string, the prosodic feature data being extracted from actual human speech, and phonetic character string and the linguistic data corresponding to a speech to be synthesized;
a retrieving means for retrieving a prosodic feature data from the prosodic feature data stored in the prosodic information database, the retrieved prosodic feature data corresponding to at least a portion of retrieval items composed of the phonetic character string and the linguistic data outputted from the language processing means;
a prosody modifying means for modifying the prosodic feature data according to a predetermined rule in response to a degree of matching between the retrieval item and the data stored in the prosodic information database, the prosodic feature data being retrieved and selected from the prosodic information database; and
a waveform generating means for generating a speech waveform based on the prosodic feature data received from the prosody modifying means and the phonetic character string received from the language processing means.
The system according to this configuration of the invention also achieves a reasonably appropriate, smooth, and natural sounding speech based on an arbitrary input text.