TTS is a technique used for text-to-speech synthesis, and particularly, a technique that converts any text information into a standard and fluent speech. TTS concerns multiple advanced high technologies such as natural language processing, metrics, speech signal processing and audio sense, stretches across multiple subjects like acoustics, linguistics and digital signal processing, and is an advanced technique in the field of text information processing.
The traditional TTS system pronounces with only one standard male or female voice. The voice is monotonic and cannot reflect various speaking habits of all kinds of persons in life; for example, if the voice lacks amusement, the listener or audience may not feel amiable or appreciate the intended humor.
For instance, the U.S. Pat. No. 7,277,855 provides a personalized TTS solution. In accordance with the solution, a specific speaker speaks a fixed text in advance, and some speech feature data of the specific speaker is acquired by analyzing the generated speech, then a TTS is performed based on the speech feature data with a standard TTS system, so as to realize a personalized TTS. The main problem of the solution is that the speech feature data of the specific speaker would be acquired through a special “study” process, while much time and energy would be spent in the “study” process and there is no enjoyment, besides, the validity of the “study” effect is obviously influenced by the selected material.
With the popularization of such devices having functions of both text transfer and speech communication, a technology is needed that can easily acquire personalized speech features of any one or both parties of the communication when a subscriber performs a speech communication through the device, and can represent a text by synthesizing it into speech based on the acquired personalized speech during the subsequent text communication.
In addition, there is a need for a technology that can easily and accurately recognize the speech features of a subscriber for further utilization from a random speech segment of the subscriber.