Speech is an important mechanism for improving access and interaction with digital information via computerized systems. Voice-recognition technology has been in existence for some time and is improving in quality. A type of technology similar to voice-recognition systems is speech-synthesis technology, including “text-to-speech” translation. While there has been much attention and development in the voice-recognition area, mechanical production of speech having characteristics of normal speech from text is not well developed.
In text-to-speech (TTS) engines, samples of a voice are recorded, and then used to interpret text with sounds in the recorded voice sample. However, in speech produced by conventional TTS engines, attributes of normal speech patterns, such as speed, pauses, pitch, and emphasis, are generally not present or consistent with a human voice, and in particular not with a specific voice. As a result, voice synthesis in conventional text-to-speech conversions is typically machine-like. Such mechanical-sounding speech is usually distracting and often of such low quality as to be inefficient and undesirable, if not unusable.
Effective speech production algorithms capable of matching text with normal speech patterns of individuals and producing high fidelity human voice translations consistent with those individual patterns are not conventionally available. Even the best voice-synthesis systems allow little variation in the characteristics of the synthetic voices available for speaking textual content. Moreover, conventional voice-synthesis systems do not allow effective customizing of text-to-speech conversions based on voices of actual, known, recognizable speakers.
Thus, there is a need to provide systems and methods for producing high-quality sound, true-to-life translations of text to speech, and translations having speech characteristics of individual speakers. There is also a need to provide systems and methods for customizing text-to-speech translations based on the voices of actual, known speakers.
Voice synthesis systems often use phonetic units, such as phonemes, phones, or some variation of these units, as a basis to synthesize voices. Phonetics is the branch of linguistics that deals with the sounds of speech and their production, combination, description, and representation by written symbols. In phonetics, the sounds of speech are represented with a set of distinct symbols, each symbol designating a single sound. A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning, as the “m” in “mat” and the “b” in “bat” in English. A linguistic phone is a speech sound considered without reference to its status as a phoneme or an allophone (a predictable variant of a phoneme) in a language. (The American Heritage Dictionary of the English Language, Third Edition.)
Text-to-speech translations typically use pronouncing dictionaries to identify phonetic units, such as phonemes. As an example, for the text “How is it going?”, a pronouncing dictionary indicates that the phonetic sound for the “H” in “How” is “huh.” The “huh” sound is a phoneme. One difficulty with text-to-speech translation is that there are a number of ways to say “How is it going?” with variations in speech attributes such as speed, pauses, pitch, and emphasis, for example.
One of the disadvantages of conventional text-to-speech conversion systems is that such technology does not effectively integrate phonetic elements of a voice with other speech characteristics. Thus, currently available text-to-speech products do not produce true-to-life translations based on phonetic, as well as other speech characteristics, of a known voice. For example, the IBM voice-synthesis engine “DirectTalk” is capable of “speaking” content from the Internet using stock, mechanically-synthesized voices of one male or one female, depending on content tags the engine encounters in the markup language, for example HTML. The IBM engine does not allow a user to select from among known voices. The AT&T “Natural Voices” TTS product provides an improved quality of speech converted from text, but allows choosing only between two male voices and one female voice. In addition, the AT&T “Natural Voices” product is very expensive. Thus, there is a need to provide systems and methods for customizing text-to-speech translations based on speech samples including, for example, phonetic, and other speech characteristics such as speed, pauses, pitch, and emphasis, of a selected known voice.
Although conventional TTS systems do not allow users to customize translations with known voices, other communication formats use customizable means of expression. For example, print fonts store characters, glyphs, and other linguistic communication tools in a standardized machine-readable matrix format that allow changing styles for printed characters. As another example, music systems based on a Musical Instrument Digital Interface (MIDI) format allow collections of sounds for specific instruments to be stored by numbers based on the standard piano keyboard. MIDI-type systems allow music to be played with the sounds of different musical instruments by applying files for selected instruments. Both print fonts and MIDI files can be distributed from one device to another for use in multiple devices.
However, conventional TTS systems do not provide for records, or files, of multiple voices to be distributed for use in different devices. Thus, there is a need to provide systems and methods that allow voice files to be easily created, stored, and used for customizing translation of text to speech based on the voices of actual, known speakers. There is also a need for such systems and methods based on phonetic or other methods of dividing speech, that include other speech characteristics of individual speakers, and that can be readily distributed.