In text-to-speech (TTS) systems, a portion of text (or a text file) is converted into audio speech (or an audio speech file). Such systems are used in a wide variety of applications such as electronic games, e-book readers, e-mail readers, satellite navigation, automated telephone systems, and automated warning systems. For example, some instant messaging (IM) systems use TTS synthesis to convert text chat to speech. This can be very useful for people who have difficulty reading, people who are driving, or people who simply do not want to take their eyes off whatever they are doing to change focus to the IM window.
A problem with TTS synthesis is that the synthesized speech can lose attributes such as emotions, vocal expressiveness, and the speaker's identity. Often all synthesized voices will sound the same. There is a continuing need to make systems sound more like a natural human voice.
U.S. Pat. No. 8,135,591 issued on Mar. 13, 2012 describes a method and system for training a text-to-speech synthesis system for use in speech synthesis. The method includes generating a speech database of audio files comprising domain-specific voices having various prosodies, and training a text-to-speech synthesis system using the speech database by selecting audio segments having a prosody based on at least one dialog state. The system includes a processor, a speech database of audio files, and modules for implementing the method.
U.S. Patent Application Publication No. 2013/0262119 published on Oct. 3, 2013 teaches a text-to-speech method configured to output speech having a selected speaker voice and a selected speaker attribute. The method includes inputting text; dividing the inputted text into a sequence of acoustic units; selecting a speaker for the inputted text; selecting a speaker attribute for the inputted text; converting the sequence of acoustic units to a sequence of speech vectors using an acoustic model; and outputting the sequence of speech vectors as audio with the selected speaker voice and the selected speaker attribute. The acoustic model includes a first set of parameters relating to speaker voice and a second set of parameters relating to speaker attributes, which parameters do not overlap. Selecting a speaker voice includes selecting parameters from the first set of parameters and selecting the speaker attribute includes selecting the parameters from the second set of parameters. The acoustic model is trained using a cluster adaptive training method (CAT) where the speakers and speaker attributes are accommodated by applying weights to model parameters which have been arranged into clusters, a decision tree being constructed for each cluster. Embodiments where the acoustic model is a Hidden Markov Model (HMM) are described.
U.S. Pat. No. 8,886,537 issued on Nov. 11, 2014 describes a method and system for text-to-speech synthesis with personalized voice. The method includes receiving an incidental audio input of speech in the form of an audio communication from an input speaker and generating a voice dataset for the input speaker. A text input is received at the same device as the audio input and the text is synthesized from the text input to synthesized speech using a voice dataset to personalize the synthesized speech to sound like the input speaker. In addition, the method includes analyzing the text for expression and adding the expression to the synthesized speech. The audio communication may be part of a video communication and the audio input may have an associated visual input of an image of the input speaker. The synthesis from text may include providing a synthesized image personalized to look like the image of the input speaker with expressions added from the visual input.