Speech synthesis is the production of speech from text by artificial means. For example, text-to-speech (TTS) systems synthesize speech from text to provide an alternative to conventional computer-to-human visual output devices like computer monitors or displays. There are many varieties of TTS synthesis, including formant TTS synthesis and concatenative TTS synthesis. Formant TTS synthesis does not output recorded human speech and, instead, outputs computer generated audio that tends to sound artificial and robotic. In concatenative TTS synthesis, segments of stored human speech are concatenated and output to produce smoother, more natural sounding speech.
A TTS system may include the following basic elements. A source of raw text includes words, numbers, symbols, abbreviations, and/or punctuation to be synthesized into speech. A speech database includes pre-recorded speech from one or more people. A pre-processor converts the raw text into an output that is the equivalent of written words. A synthesis engine phonetically transcribes the pre-processor output and converts the pre-processor output into appropriate language units like sentences, clauses, and/or phrases. A unit selector selects units of speech from the speech database that best correspond to the language units from the synthesis engine. An acoustic interface converts the selected units of speech into audio signals, and a loudspeaker converts the audio signals to audible speech.
One problem encountered with TTS synthesis is that some applications may use speech recorded from different people having significantly different voices. For example, TTS-enabled vehicle navigation systems use voice guidance having a multiple part syntax that may include a directional maneuver utterance (e.g. “Perform legal U-turn onto . . . ”) and a street name utterance (e.g. “ . . . North Telegraph Road.”) The maneuver utterance may be generated from a first speaker of a navigation service provider, and the street name utterance may be generated from a second speaker of a map data provider. When the utterances are played together during voice guidance, the combined utterance may sound unpleasant to a user. For example, the user may perceive the transition from the maneuver utterance to the street name utterance, for example, because of the difference in prosody between the speakers.