The present invention relates to a speech synthesis method utilizing auxiliary information, a recording medium in which steps of the method are recorded and apparatus utilizing the method and, more particularly, to a speech synthesis method and apparatus that create naturally sounding synthesized speech by additionally using, as auxiliary information, actual human speech information as well as text information.
With a text speech synthesis scheme that synthesizes speech from texts, speech messages can be created with comparative ease and at low cost. However, speech synthesized by this scheme does not have sufficient quality and is far apart from speech actually uttered by human beings. That is, parameters necessary for text speech synthesis in the prior art are all estimated by rules of speech synthesis based on the results of text analysis. On this account, unnatural speech may sometimes be synthesized due to an error in the text analysis or imperfection in the rules of speech synthesis. Furthermore, human speech fluctuates so much in the course of utterance that it is said human beings cannot read twice the same sentence in exactly the same speech sounds. In contrast to this, speech synthesis by rule has a defect that speech messages are monotonous because the rules therefor are mere modeling of average features of human speech. It is mainly for the two reasons given above that the intonation of speech by speech synthesis by rule at present is criticized as unnatural. If these problems can be fixed, the speech synthesis by text will become an effective method for creating speech messages.
On the other hand, in the case of generating speech messages by direct utterance of a human being, it is necessary to hire an expert narrator and prepare a studio or similar favorable environment for recording. During recording, however, even an expert narrator often makes wrong or indistinct utterances and must try again and again; hence, recording consumes an enormous amount of time. Moreover, the speed of utterance must be kept constant and care should be taken of the speech quality that varies with the physical condition of the narrator. Thus, the creation of speech messages costs a lot of money and requires much time.
There is a strong demand in a variety of fields for services of repeatedly offering the same speech messages recorded by an expert narrator in association with an image or picture, if any, just like audio guide messages that are commonly provided or furnished in an exhibition hall or room. Needless to say, the recorded speech messages must be clear and standard in this instance. And when a display screen is used, it is necessary to establish synchronization between the speech messages and pictures or images provided on the display screen. To meet such requirements, it is customary in the art to record speech of an expert narrator reading a text. The recording is repeated until clear, accurate speech is obtained with required quality; hence, it is time-consuming and costly.
Incidentally, when the speech data thus obtained needs to be partly changed after several months or years, it is to be wished that the part of the existing speech messages that is to be changed have the same features (tone quality, pitch, intonation, speed, etc.) as those of the other parts. Hence, it is preferable to have the same narrator record the changed or re-edited speech messages. However, it is not always possible to get cooperation from the original narrator, and if he or she cooperates, it is difficult for him or her to narrate with the same features as in the previous recording. Therefore, it would be very advantageous if it were possible to extract speech features of the narrator and use them to synthesize speech following a desired text or speech sounds of some other person with reproducible features at arbitrary timing.
Alternatively, recording of speech in an animation requires speech of a different feature for each character and animation actors or actresses of the same number as the characters involved record their voice parts in a studio for a long time. If it were possible to synthesize speech from a text through utilization of speech feature information extracted from speech of ordinary people having characteristic voices, animation production costs could be cut.