1. Technical Field
This invention relates to the field of text-to-speech synthesis and more particularly to a method for guiding text-to-speech output timing using speech recognition markers.
2. Description of the Related Art
The present invention relates to a text-to-speech [TTS] system for converting input text into an output acoustic signal imitating natural speech. TTS systems create artificial speech sounds directly from text input. Conventional TTS systems generally operate in a sequential manner, dividing the input text into relatively large segments such as sentences using an external process. Subsequently, each segment is sequentially processed until the required acoustic output can be created.
Initially, input text can be submitted to the TTS system. Subsequently, the TTS system can convert the input text to an acoustic waveform recognizable as speech corresponding to the input text. A typical TTS system can include two main components: a linguistic processor and an acoustic processor. The linguisitic processor can generate lists of speech segments derived from the text input, together with control information, for example phonemes, plus duration and pitch values. Subsequently, during the conversion processes the input text can pass across an interface from the linguistic processor to the acoustic processor. The acoustic processor produces the sounds corresponding to the specified segments. Moreover, the acoustic processor handles the boundaries between each speech segment to produce natural sounding speech.
Unfortunately, to date most commercial systems for automated synthesis remain too unnatural and machine-like for all but the simplest and shortest texts. Those systems have been described as sounding monotonous, boring, mechanical, harsh, disdainful, peremptory, fuzzy, muffled, choppy, and unclear. Synthesized isolated words presented in context are relatively easy to recognize, but when strung together into longer passages of connected speech, for instance phrases or sentences, then it becomes much more difficult to follow the meaning. Notably, studies have shown that the task is unpleasant and the effort is fatiguing. In consequence, more widespread adoption of TTS technology has been prevented by the perceived robotic quality of some voices and poor intelligibility of intonation-related cues.
In general, the robotic feel of the TTS system arises from inaccurate or inappropriate modeling of speech segments defined in TTS production rules. To overcome such deficiencies, considerable attention has been paid to improving the production rules by modeling grammatical information derived from a series of connected words. In the prior art, typical TTS production rules are designed to cope with “unrestricted text”. Synthesis algorithms for unrestricted text typically assign prosodic features (prosody) on the basis of syntax, lexical properties, and word classes. Prosody primarily involves pitch, duration, loudness, voice quality, tempo and rhythm. In addition, prosody modulates every known aspect of articulation. Specifically, prosodic features can be derived from the organization imposed onto a string of words when they are uttered as connected speech.
TTS system developers have struggled with the problem of prosodic phrasing, or the “chunking” of a long sentence into several sub-phrases, each of which can be said to stand alone as an intonational unit. If punctuation is used liberally so that there are relatively few words between the commas, semicolons or periods, then TTS production rules can propose a reasonable guess at an appropriate phrasing by subdividing the sentence at each punctuation mark. Notwithstanding, a problem remains where there exists long stretches of words having no punctuation. In that case, the TTS production rules must strategically place appropriate pauses in the playback sequence.
One prior art approach includes the generation and storage of a list of words, typically function words, that are likely indicators of good break positions. Yet, in some cases a particular function word may coincide with a plausible phrase break whereas in other cases that same function may coincide with a particularly poor phrase break position. As such, a known improvement includes the incorporation of an accurate syntactic parser for generating syntactic groupings and the subsequent derivation of the prosodic phrasing from the syntactic groupings. Still, prosodic phrases usually do not coincide exactly with major syntactic phrases.
Alternatively, the TTS system developer can train a decision tree on transcribed speech data. Specifically, the transcribed speech data can include a dependent variable linked to the human prosodic phrase boundary decision. Moreover, the transcribed speech data can include independent variables linked to the text directly, including part of speech sequence around the boundary, the location of the edges of long noun phrases, and the distance of the boundary from the edges of the sentence. Nevertheless, TTS output generated by production rules alone cannot produce proper pausing behavior. Present methods of TTS generation wholly lack naturalized timing in consequence of the TTS system's dependence on production rules. Present TTS systems do not incorporate the use of timing data embedded in the dictated text with standard production rules in order to generate more naturalized playback timing. Thus, a need exists for an algorithm which can produce a more natural playback though the use of speech-recognition markers embedded in the dictated text.