Text-to-speech (TTS) synthesis is used in various different environments in which text is input or received at a device and audio speech output of the content of the text is output. For example, some instant messaging (IM) systems use TTS synthesis to convert text chat to speech. This is very useful for blind people, people or young children who have difficulties reading, or for anyone who does not want to change his focus to the IM window while doing another task.
In another example, some mobile telephone or other handheld devices have TTS synthesis capabilities for converting text received in short message service (SMS) messages into speech. This can be delivered as a voice message left on the device, or can be played straightaway, for example, if an SMS message is received while the recipient is driving. In a further example, TTS synthesis is used to convert received email messages to speech.
A problem with TTS synthesis is that the synthesized speech loses a person's identity. In the IM application where multiple users may be contributing during a session, all IM participants whose text is converted using TTS may sound the same. In addition, the emotions and vocal expressiveness that can be conveyed using emotion icons and other text based hints are lost.
US 2006/0074672 discloses an apparatus for synthesis of speech using personalized speech segments. Means are provided for processing natural speech to provide personalized speech segments and means are provided for synthesizing speech based on the personalized speech segments. A voice recording module is provided and speech input is made by repeating words displayed on a user interface. This has the drawback that speech can only be synthesized to personalized speech that has been input into the device by a user repeating the words. Therefore, the speech cannot be synthesized to sound like a person who has not purposefully input their voice into the device.
In relation to the expression of synthesized voice, it is known to put specific commands inside a multimedia message or in a script in order to force different emotion of the output speech in TTS synthesis. In addition, IM systems with expressive animations are known from “A chat system based on Emotion Estimation from text and Embodied Conversational Messengers”, Chunling Ma, et al (ISBN: 3 540 29034 6) in which an avatar associated with a chat partner acts out assessed emotions of messages in association with synthesized speech.