The present invention relates in general to improved speech systems. More particularly the present invention relates to a method and apparatus for adding new words with yet unseen spellings and pronunciations to the vocabulary of a speech system.
Today""s speech recognition systems, such as xe2x80x9ccommand and controlxe2x80x9d or xe2x80x9cdictationxe2x80x9d systems, all typically contain predefined vocabularies, consisting of words, their pronunciations and some model of the usage of these words described by a language model. State-of-the-art systems may contain several tens of thousands of such entries which are used at runtime to determine what is being said.
Regardless of the size of the vocabulary, all systems suffer from the fact that they offer only a limited, fixed vocabulary to the user. The fact that commercially available systems typically only contain full form vocabularies (i.e., they do not model separately the morphology of the language) further limits the effective scope of today""s vocabularies. This is especially limiting for highly inflective languages such as French, German or Slavic languages. Consequently almost every user will need to add to this vocabulary their own special terms, names or expressions to fit their individual needs. Being able to extend the base vocabulary with specific terms thus becomes an important issue and frequent activity when using speech recognition systems. From a principle point of view, the language vocabularies have to be viewed as xe2x80x9copen or living systemsxe2x80x9d which never can comprise all possible words of a certain language; in addition, technical limitations (storage requirements and processing load) make it even more impossible to achieve this goal. Thus the methodology and quality of the process to extend a certain vocabulary with new words is an important success factor of speech systems.
The pronunciations of words in a vocabulary are typically stored as phonetic transcriptions (be it phonemes, sub-phonemes or combinations of phonemes). Adding new words to the vocabulary requires the generation of such phonetic transcriptions (pronunciations) to allow for the subsequent recognition of these words. It is imperative that a speech recognition system build adequate acoustic models for these new words, as recognition accuracy is strongly dependent on the quality of these models. Generating inadequate models is likely to result in degraded overall performance and lower recognition accuracy of the system. Therefore, any improvement of the methodology and quality of this extension process is of great importance.
According to the current state of the art, a word is typically added to the system by having the user type in the new word and constructing, from the spelling (and most often a sound sample, i.e., the user pronouncing the new word), a new acoustic pattern to be used in future recognition. An algorithmic or statistical system, broadly called a xe2x80x9cLetter-to-Sound Systemxe2x80x9d (LTS), is used to derive the most likely pronunciation(s) of the sequence of letters composing the orthographic representation of the word. In general, a Letter-To-Sound System maps individual letters or combinations of letters to a sequence of phonemes which match their pronunciation. Frequently, a statistical approach is used to generate such systems. An important example for the statistical approach are CARTs (classification and regression trees). The results generated by a LTS are then combined with the acoustics provided by the user to generate the actual pronunciation(s). A detailed description of one example of how a statistical system may be employed for this task is taught by J. M. Lucassen and R. L. Mercer xe2x80x9cAn Information Theoretic Approach to the Automatic Determination of Phonemic Baseforms,xe2x80x9d Proc. of ICASSP-84, 42.5.1-42.5.4, 1982, the disclosure of which is incorporated by reference herein.
Frequently, however, the words added are words derived of a foreign language, customers"" names, acronyms, or technical terms generally not obeying the pronunciation rules of the language per se. This is likely to result in inferior pronunciations being generated which will cause frequent misrecognitions when running the system, thus degrading the overall performance and quality of the speech system. Sophisticated systems may detect that the acoustics provided (for instance, by the user pronouncing the word) do not match the generated candidate pronunciations and prompt the user for some additional input. However, since users of these systems usually are not phoneticians or even versed in phonetics, it is important, both from a usability and efficacy point of view to limit their involvement in the generation of these pronunciations to a minimum.
Some systems allow to specify a xe2x80x9csounds-like-spellingxe2x80x9d (SLS) pattern (a pseudo-spelling of the word that corresponds to the way the word is pronounced in the given language, like xe2x80x9ceye-triple-eexe2x80x9d for English for the word xe2x80x9cIEEExe2x80x9d) to support this process. This approach puts the onus on the user to determine whether the word to be added indeed follows the standard pronunciation rules or not, and to provide an alternative spelling that does. These rules are not clearly defined and may even vary within subdomains of a language. This approach tends to break down with users who are either not very careful, not very familiar with the language and/or domain or who are not very well versed in phonetics.
Letter-to-Sound Systems are also used in various other applications of speech systems, such as speech synthesis of words that are not in the basic lexicon. Like speech recognition systems, these xe2x80x9ctext-to-speechxe2x80x9d synthesis systems (TTS) are faced with a similar difficulty when trying to generate the pronunciation of a word that is not in their basic lexicon.
To demonstrate the urgency of improvements in this area, reference is made for instance to the xe2x80x9cAngiexe2x80x9d framework (an example of a Letter-to-Sound System) description in Aarati D. Parmarxe2x80x94master Thesis, MIT 97, A Semi-Automatic System for the Syllabification and Stress Assignment of Large Lexicons, available at: http://www.sls.lcs.mit.edu/sls/publications/index.html. In this experiment, on the TIMIT database, 10 words out of 2500 failed to generate a correct pronunciation because of xe2x80x9cirregular spellingxe2x80x9d or xe2x80x9cfailed letter rules.xe2x80x9d And this test set even does not include acronyms, or anything of the like which are likely to be encountered in everyday business environments.
The present invention provides an improved method and apparatus for adding new words with yet unseen spellings and pronunciations to a vocabulary of a speech system.
In one aspect of the invention, a computerized method is provided for adding a new word to a vocabulary of a speech system, the vocabulary comprising words and corresponding acoustic patterns for a language or language domain. Within a determination step for the new word, a regularity value is determined which measures the conformity with respect to the pronunciation in the language or language domain. In a comparison step, the regularity value is compared to a threshold value to decide whether the conformity is insufficient. Only in the affirmative case of insufficient conformity, a prompting step is performed, prompting for additional information on the pronunciation of the new word. Finally, in an extension step, the new word and an acoustic pattern of the new word are added to the vocabulary.
The present invention provides an automatic determination of the regularity of a proposed word with respect to the standard pronunciation of the language. This lowers the requirement for attention and skills on the user""s part in the extension process of a vocabulary. It is neither left up to the user when additional information concerning the pronunciation of a new word is to be introduced to the speech system, nor is this additional information omitted when it is needed. Otherwise, in both cases, the construction of inferior pronunciation models would be the consequence. As the recognition accuracy is strongly dependent on the quality of these models, the inventive teachings result in an improved overall performance and higher recognition accuracy of the speech system. The quality of the generated pronunciations in speech systems is improved.
Furthermore, as a user""s involvement by prompting him for additional information with respect to the pronunciation is reduced to a minimum, the user-interface can be kept simpler and the user will not have to be exposed to unneeded complexity. As words likely to be pronounced in a standard way do not require further action, valuable time-savings are the result. This is a major selling argument for typical clients using speech recognition systems such as lawyers and medical doctors.
The present invention is inherently language and domain independent and may thus be applied to a variety of languages and domains without further extension. This property is of specific advantage in view of the large number of different languages and language domains which all can be supported with a single solution approach.
Finally, the reduced number of failures during adding new words to a vocabulary leads to a reduced user frustration and an improved perception of system usability.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.