The present invention, in some embodiments thereof, relates to a system for speech synthesis and, more specifically, but not exclusively, to a system for speech synthesis from text.
Prosody refers to elements of speech that are not individual phonetic segments (vowels and consonants) but are properties of syllables as well as of larger units of speech or smaller (sub phonemic) units of speech. These elements contribute to linguistic functions such as intonation, tone, stress, and rhythm. Prosody may reflect various features of a speaker or an utterance: an emotional state of the speaker; a form of the utterance (statement, question, or command); presence of irony or sarcasm; emphasis, contrast, and focus; or other elements of language that may not be encoded by grammar or by choice of vocabulary. Prosody may be described in terms of auditory measures. Auditory measures are subjective impressions produced in the mind of a listener. Examples of auditory measures are a pitch of a voice, a length of a sound, a sound's loudness and a timbre. Another possible way to describe prosody is using terms of acoustic measures. Acoustic measures are physical properties of a sound wave that may be measured objectively. Examples of acoustic measures are a fundamental frequency, duration, an intensity level, and spectral characteristics of the sound wave.
Speech synthesis refers to artificial production of human speech. One of the challenges faced by a system for synthesizing speech, for example from text, is generation of natural sounding prosody. There are applications, for example Concept To Speech (CTS) applications, where it is desirable to convey non-linguistic cues, for example speaking styles, emotions, and word emphasis. An example of a CTS is a dialog generation application such as an automatic personal assistant. In some CTS applications the input is machine generated text or a machine generated message. A text to speech (TTS) system, for synthesizing speech from text, may receive as an input a textual input and produce a phonetic and semantic representation of the textual input comprising a plurality of textual feature vectors. The plurality of textual feature vectors may be delivered to a TTS backend comprising a waveform generator to convert into sound, producing a waveform of speech. In some TTS systems, target prosody is imposed on the speech waveform, before delivering the waveform to an audio device or to an audio file. Given a text and a set of labels marking one or more non-linguistic cues, the TTS system needs a way to render the prosodic contour of the synthesized speech in order to convey the emotional content.
Some systems apply machine learning to create a model for predicting expressive prosody from textual feature vectors. One possible method for creating a model is by learning a difference between a plurality of expressive recordings of a plurality of utterances to a plurality of equivalent parallel neutral (non-expressive) recordings of the plurality of utterances, dependent on the textual features.