This invention relates to the field of speech recognition and speech synthesis. This invention is particularly applicable to the generation of speech recognition dictionaries including transcriptions for use in speech recognition systems as may be used in a telephone directory assistance system, voice activated dialing (VAD) system, personal voice dialing system and other speech recognition enabled services. This invention is also applicable to text to speech synthesizers for generating suitable pronunciations of vocabulary items.
Speech recognition enabled services are more and more popular today. The services may include stock quotes, directory assistance, reservations and many others.
In typical speech recognition systems, tie user enters his request using isolated word, connected word or continuous speech via a microphone or telephone set. If valid speech is detected, the speech recognition layer of the system is invoked in an attempt to recognize the unknown utterance. Typically, entries in a speech recognition dictionary are scored in order to determine the most likely match to the utterance. The recognition of speech involves aligning an input audio signal with the most appropriate target speech model.
Speech recognition dictionaries used in such speech recognition systems typically comprise a group of transcriptions associated to a given vocabulary item. A transcription is a representation of the pronunciation of the associated vocabulary item when uttered by a human. Typically, a transcription is the acoustic representation a vocabulary item as a sequence of sub-transcription units. A number of acoustic sub-transcription units can be used in a transcription such as phonemes, allophones, triphones, syllables and dyads (demi-syllables). Commonly, the phoneme is used as the sub-transcription unit and the representation in such case is designated as xe2x80x9cphonemic transcriptionxe2x80x9d.
In most cases, multiple transcriptions are provided for each vocabulary item thereby allowing for different pronunciations of the vocabulary item. Typically, a limit on the total number of transcriptions in a speech recognition dictionary is imposed due to the inherent computational limits of the speech recognizer as well as due to the memory requirements for storing the transcriptions. Commonly, the limit on the total number of transcriptions is put into practice by limiting the maximum number of transcriptions stored for each vocabulary item.
Of particular interest here are multi-lingual pronunciations. A common method is to provide for each vocabulary item and for each language that the dictionary is desirous to support a transcription in order to account for the different possible pronunciations of the vocabulary item in the different languages. A specific example will better illustrate this method. Suppose the vocabulary item xe2x80x9cRobertxe2x80x9d and the languages that the dictionary is desirous to support are comprised of French, English, German, Russian and Spanish. The dictionary will comprise five transcriptions for each vocabulary item, one transcription for each language.
A deficiency of the above-described method is that the above-described method does not provide any mechanism for including language probability information in the selection of the transcriptions. Consequently, a large number of transcriptions having a low likelihood of being used by a speech processing device are stored taking up memory space and increase the computational load of speech processing devices making use of the transcriptions since more transcriptions have to be scored. Continuing the specific example of the vocabulary item xe2x80x9cRobertxe2x80x9d, it is unlikely for this vocabulary item to be pronounced on the basis of a Russian pronunciation since xe2x80x9cRobertxe2x80x9d is an uncommon name in that language.
Thus, there exists a need in the industry to refine the process of generating a group of transcriptions capable of being used by a speech processing device such as a speech recognition dictionary or a text to speech synthesizer.
A method and apparatus for generating transcriptions suitable for use in a speech-processing device. The invention provides processing the vocabulary item to derive a characteristic from the vocabulary item allowing to divide a pool of available languages in a first sub-group and a second sub-group. The vocabulary item has a higher probability of belonging to any one of the languages in the first sub-group than belonging to any language in the second sub-group. The invention further provides processing the vocabulary item to generate a group of transcriptions, the group of transcriptions characterized by the absence of at least one transcription belong to a language in the second sub-group of languages.
The advantage of this data structure over prior art data structures resides in the reduction of unnecessary transcriptions.
In a specific example of implementation, the vocabulary items in the sub-set are further associated to transcriptions belonging to a common default language.
Preferably but not essentially, a characteristic allowing to divide the pool of available languages in the first sub-group and the second sub-group is the etymology of the vocabulary item.
In accordance with another broad aspect, the invention further provides a method for generating a group of transcriptions suitable for use in a speech processing device. The method comprises providing a vocabulary item and processing it to derive a characteristic allowing to divide a pool of available languages in a first sub-group and a second sub-group. The vocabulary item manifests a higher probability of belonging to any language in the first sub-group than belonging to a language in the second sub-group. The method further comprises processing the vocabulary item to generate a group of transcriptions, the group of transcriptions being characterized by the absence of at least one transcription belonging to a language in the second sub-group of languages established for the vocabulary item. Optionally, the method further comprises storing the group of transcriptions on a computer readable storage medium in a format suitable for use by a speech-processing device.
Preferably but not essentially, the method provides processing the vocabulary item to generate transcriptions corresponding to each language belonging to the first sub-group.
Preferably but not essentially, the characteristic allowing to divide the pool of available languages in the first sub-group and the second sub-group is the etymology of the vocabulary item.
In accordance with another broad aspect, the invention further provides an apparatus for implementing the above-described method.
In accordance with another broad aspect, the invention provides a computer readable medium comprising a program element suitable for execution by a computing apparatus for implementing the above-described method.
In accordance with another broad aspect, the invention further provides a computer readable medium containing a speech recognition dictionary comprising transcriptions generated by the above-described method.
Other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures.