1. Field of the Invention
This invention relates generally to the field of speech applications, and in particular, to a method and apparatus for automatically storing, tracking and distributing new word pronunciations to speech application clients on a network.
2. Description of Related Art
Use of spoken language with computers, typically associated with speech recognition and speech synthesis, involves storing and retrieving not only word spellings but other data associated with words, for example phonemes, alternate pronunciations, associations with other words and parts of speech, such as noun, verb, adjective and the like.
Computer systems at first were developed to deal exclusively with written language. Written language is useful for many things, and is much simpler to interpret, represent within, and reproduce from a computer system than is spoken language. Today, computer systems are taking on spoken language in the form of two technologies, namely, speech recognition and speech synthesis. Speech synthesis is also referred to as text-to-speech (TTS).
Defining the work to be done by computers to support spoken language is aided by comparing spoken language to written language. In the intersection of these two forms of communication there are words. Outside of the intersection words are represented differently, as spoken sounds or written letters. Written language is also augmented outside of the intersection by punctuation, or even font variations such as bold for emphasis. Spoken language is augmented differently outside of the intersection, for example by volume, pitch, prosody (speed), and inflection.
As computers tackle support for spoken language, spoken language as typically spoken is converted to text form by way of speech recognition and converted back to spoken form by way of speech synthesis. This takes advantage of the significantly reduced requirements on system resources to store or transmit a written representation as compared to an audible representation. The differences between written and spoken words outside of the intersection create a number of problems for speech applications.
End-users are greatly inconvenienced by the need to add word pronunciations which are not included in the starter set of vocabulary words which can be recognized. By design, the user encounters this problem as a special case of a word which was recognized incorrectly. Recognition can be viewed as a best-guess by the recognition engine as to which word was spoken by the user. When the user speaks a word which is not known to the recognition engine, the engine simply guesses wrong. The user must then initiate correction of the word and choose a new word from a short list of appropriate alternatives. If the spoken word is not listed as an alternate choice, the user typically is required to type the word and perhaps pronounce it again. This inconvenience can encourage users to bypass the proper correction procedures and simply type in the corrected text. Unfortunately, although this procedure is quicker in the short run, it is important that the speech recognition system learn about the correction with the proper procedures because this information is the only way to add and correct words, thereby improving future recognition performance.
Speech recognition engines as supplied in speech applications just aren't accurate enough and are sometimes slow in throughput due to misrecognitions and the time needed to correct misrecognitions.
Correcting misrecognized words is a major factor in measures of speech recognition effectiveness including words-per-minute and usability. Large vocabularies are provided to limit the number of corrections resulting from out-of-vocabulary words.
In a stand alone system, user added words need to be backed up and migrated from system to system as a user moves around or switches between systems (for example, between home and office computers and between portable and desktop computers) or upgrades a computer or a speech recognition application, or both. This task is time consuming, tedious, unobvious, and subsequently, not generally done.
Typically, along with speech recognition or synthesis software, a starter set of words, including pronunciations, is installed on a computer. Pronunciations are stored as base forms, which represent instructions as to the way words are pronounced, or sound. Many factors must be considered in order to produce an ideal starter set for a given application. The number of words in a starter set is usually determined by balancing considerations such as amount of storage space required, frequency of word use, degree of common use, and recognition accuracy against all other words in the starter set. Developers of speech recognition systems typically install a large vocabulary of word-pronunciation data as required for recognition. It is a rather tedious and time consuming task to produce such a vocabulary.