Rapid increase in the number of mobile phone users has encouraged implementation of various new features on mobile phones to enhance user experience. One such desirable feature is speech synthesis that converts text to speech and allows a user to avoid manual reading of text on the small screen of a mobile phone. Speech synthesis enables a mobile phone user to listen to text messages such as emails and SMS (short messaging service) messages while being engaged in other tasks (e.g., preparing a meal, navigating through snail mail letters, driving an automobile, etc.).
The synthesized speech typically resembles an artificial voice that mimics various voice characteristics such as gender, age, dialect, accent, etc. or any other voice-related data or metadata of an intended speaker, who is not related to or associated with the text. The artificial voice provides a monotonous and unrealistic listening experience to the user. Further, a concatenative speech synthesis system relies on audio recordings collected from a specific talker. Generally, time is reserved in a sound recording booth and the target talker is asked to read some text into a microphone. Therefore, collection of speech data from the recorded speech becomes dependent on speaker's availability, thereby complicating the collection of speech data across multiple speakers.
To solve these problems, a speech synthesis solution that simplifies collection of speech data while improving the realism of the synthesized speech for a better user experience is desirable.