Known speech technology sometimes has limitations when processing prosodic speech. This may be true for expressive speech synthesis when synthesized directly from the words and phrases of the text to be synthesized as speech, and also for speaker-independent automated speech recognition, which should accurately recognize the emotional state of the speaker as s/he frames the words and phrases of the pronounced utterances. Expressive speech can convey both the linguistic meaning of the words as text as well as the expressive emotional meaning of the text when pronounced with a particular prosody, style, or dialect.
The disclosures of U.S. Pat. No. 8,175,879 to Nitisaroj et al. (“Nitisaroj et al.”), and U.S. Pat. No. 8,219,398, to Chandra et al. (“Chandra et al.” herein), each of which is incorporated herein by reference, are illustrative of the art. These patents describe and claim methods for improving concatenated speech synthesis and for improving automated speech recognition technologies.
Nitisaroj et al. describes a method of, and system for, automatically annotating text corpora for relationships of uttered speech for a particular speaking style and for acoustic units in terms of context and content of the text to the utterances. Some speech synthesis embodiments are described that employ text annotations to specify how the text is to be expressively pronounced as synthesized speech. In some described speech recognition embodiments, each text annotation can be uniquely identified from corresponding acoustic features of a unit of uttered speech to correctly identify the corresponding text. Using a method of rules-based text annotation, expressiveness can be altered to reflect syntactic, semantic, and/or discourse circumstances found in text to be synthesized, or in an uttered message.
Chandra et al. describes computer-implemented method for automatically analyzing, predicting, and/or modifying acoustic units of prosodic human speech utterances for use in speech synthesis or speech recognition.
Notwithstanding these advances in the art there is a need, in some cases, for improved systems and methods for synthesizing, or recognizing speech that comprises a sequence of expressive speech utterances.