Speech recognition (SR) is used to automatically convert speech to text (speech-to-text conversion). More in detail, sound (speech) is converted first into SR phoneme sequences. Normally, this is a statistical process, i.e. a set of possible phoneme sequences with varying probabilities is produced for any given utterance. Then, SR phoneme sequences are looked up in an SR lexicon that provides a mapping of SR phoneme sequences to words. Furthermore, additional algorithms (e.g. based on a language model or on grammars) are applied to generate a final textual transcription of the utterance.
The inverse process to speech recognition is speech synthesis (text-to-speech conversion). Here, a given text is converted into an enhanced phoneme sequence, that is a phoneme sequence which is enhanced with prosody (pitch, loudness, tempo, rhythm, etc.), which is then converted into sound, i.e. synthetic speech.
The SR lexicon is a means for mapping SR phoneme sequences to words. When words are added to an SR lexicon, one or more SR phoneme sequences representing the pronunciation(s) of the word are given. This can either be generated automatically (by well-known methods) or manually by the author (the user of the speech recognition system), or by somebody in an administrative/maintenance role on behalf of the author.
If there is a mismatch between SR phoneme sequences stored in the SR lexicon of a speech recognition system (i.e. the “expected pronunciations”) and the actual pronunciations used by an author, misrecognitions will occur, and the performance of the system will be bad. Therefore, the quality of the phonetic transcriptions is important.
It is also well known that prior art methods for automatically generating phonetic transcriptions do not produce “correct” results (i.e. SR phoneme sequences representing actual pronunciations) for “special” words such as acronyms because pronunciation does not follow regular rules: e.g. “NATO” is pronounced as one word, whereas for “USA” the letters U-S-A are pronounced separately.
Furthermore, authors are normally untrained in phonetic transcription and cannot be expected to produce correct transcriptions in a phonetic alphabet such as SAMPA or IPA.
Therefore, a known technique for allowing authors to guide the automatic phonetic transcription process is to let them use a “spoken like” text: instead of passing the special word to the system, at least one ordinary word that is pronounced similarly to the special word may be entered. In the example above, the “spoken like” text for “NATO” would be “nato”. The automatic phonetic transcription would then generate an SR phoneme sequence for the whole word. On the other hand, the “spoken like” text for “USA” would be “you ess a” (resulting in an SR phoneme sequence similar to spelling the separate letters). However, it is often not easy for authors to find “ordinary” words that closely represent the pronunciation of the “special” word. Another known method is to have authors speak the word and try to derive the SR phoneme sequence form the author's utterance using so-called “phoneme recognition”. This method is error prone and sensitive to noise, unclear pronunciation, etc.
In US 2005/0203738 A1, a learning technique is disclosed which is addressed to the above problem. As a solution, it is suggested to employ a speech-to-phoneme module that converts a speech into a phonetic sequence; furthermore, a text-to-phoneme component is provided to convert an inputted reference text into one or more text-based phonetic sequences. The text-based phonetic sequences are aligned in a table with the speech-based phonetic sequence, and a phonetic sequence for representing the speech input is determined. However, it is not possible for the author to judge the quality of the transcription. As an example, it is assumed that the word “route” is given; when typing this word to input it, the system might generate a phonetic transcription corresponding to a pronunciation similar to “root”, whereas the user e.g. pronounces the word similar to “rowt”. When the generated pronunciation is used later on by the speech-to-text system, the system would not recognize the word “route” when the user says “rowt”. Furthermore, when the user says “root”, the system would recognize the word “route”, which was not what the user meant to say.