This application relates to speech synthesis and speech recognition. More specifically, this application relates to improved recognition of speech and synthesis of artificial speech. Some implementations of the techniques described in this application relate even more specifically to improving recognition and synthesis of artificial speech relating to words that may be pronounced differently across multiple vernaculars.
In its current form, speech synthesis applications may not accurately synthesize speech that is comprehensible for users having various accents. This is particularly apparent when producing artificial speech sounds for words that may be pronounced differently across multiple vernaculars, such as streets, monuments, people, and so forth. Typically, synthesized speech is the same for users who speak a particular language, but not personalized for a particular user's accent. For example, a typical navigation application may have a voice engine that produces an English voice, a French voice, a German voice, and so forth, depending on which language the user has selected in the voice engine settings, but the typical navigation application does not personalize the English voice for a user from a the mid-western region of the United States or the French voice for a user from the Provence region of France.
In its current form, speech synthesis uses direct translation to produce speech sound of a selected language. Current methods convert text to a phonetic form that includes a set of phonemes (i.e., a unit of sound) and send the set of phonemes making up the phonetic form to a speech engine, which produces the voice output.
Similarly, some current methods of computer based speech recognition convert speech to text by comparing recorded speech to an audio database to search for a text word. However, these speech recognition methods do not customize the recognition to the particular accent of a user. For example, current methods might compare a word spoken by someone with a particular accent to audio corresponding to a different accent.
The pronunciation of a word may vary in different languages and dialects, even when the word is the same or similar across languages and dialects, such as for a regional or proper noun. Thus, using the typical methods of text to phonetic translation for words that may be pronounced differently across multiple vernaculars will not produce understandable pronunciation for many individuals.
In its current form, sounds produced by speech synthesis are not very understandable to a user with an accent, particularly when producing or recognizing words that may be pronounced differently across multiple vernaculars. Similarly, in its current form, computer based speech recognition may have difficulty processing a user's speech when the user has an accent.