This invention relates generally to generating phonetic spellings of words, and more specifically to a method and apparatus for generating phonetic spellings of words that are collected in a pronunciation dictionary, such that the phonetic spellings are generated by a pronunciation-learning module configured to accept as input a list of words and transcribed acoustic data that includes acoustic samples of words spoken by a set of speakers and the transcribed words therefor.
Automatic speech recognition systems and speech synthesis systems are being deployed in a broad variety of government, business, and personal applications. Such systems provide simplified, automated communication between people and computers. Constructing automatic speech recognition (ASR) systems and speech synthesis systems is a laborious process performed by experts in the fields of linguistic modeling and acoustic modeling. The creation of given aspects of ASR systems and speech synthesis systems has been automated to some extent, such as by automatic generation of pronunciation dictionaries. Pronunciation dictionaries typically include phonetic spellings (or “pronunciations”) of words spelled with the phones of a phonetic alphabet. Pronunciation dictionaries and their pronunciations can be used by both ASR systems and speech synthesis systems to facilitate communication between people and computers. For example, ASR systems can be configured to compare an acoustic waveform of a spoken word against a set of pronunciations in a pronunciation dictionary to determine whether the spoken word matches one or more of the pronunciations. In matching spoken words to pronunciations, meanings can be extracted from the spoken words and can be used to direct a computer or machine to perform a requested task, such as dialing a telephone extension, making a bank deposit or other task. Speech synthesis systems can be configured to use a pronunciation dictionary by electronically articulating words according to their pronunciations in the pronunciation dictionary. For example, in an automated telephone dialing system, a speech synthesis system can be configured to articulate names or other words as they are phonetically spelled in a pronunciation dictionary.
One automated method of generating pronunciation dictionaries includes the use of letter-to-phone engines configured to match sequences of phones to sets of alphabetic letters of a spelled word. While letter-to-phone engines have been used with some success to generate pronunciations of simple words, more complicated words, such as given names and surnames, do not lend themselves as easily to letter-to-phone matching to generate valid pronunciations. For example, an. American speaker is likely to pronounce the first inventor's surname, Beaufays, as [b u f e] (Computer Phonetic Alphabet spelling), a French speaker is likely to say [b o f e], and a French-speaking Belgian will likely say [b o f A i]. A letter-to-phone engine is likely to generate a pronunciation not matching any of the above pronunciations due, for example, to the silence of given letters in the spoken name and varied pronunciations of letter groups.
Linguists are often employed to verify and adjust pronunciations generated by letter-to-phone engines. However, the use of trained linguists to correct pronunciations is relatively costly and relatively slow. For example, a well-trained linguist may be able to generate and/or correct the pronunciations of about 65 to 85 words per hour. If, however, a linguist does not have access to acoustic samples of the words for which corrected pronunciations are desired, the linguist may be unable to correct those pronunciations. Moreover, if a linguist is not trained in a given foreign language or a given dialect of a foreign language, the linguist may be unable to verify and correct pronunciations, including especially given names and surnames. As the demand for larger and relatively more accurate ASR systems and speech synthesis systems increases, so too does the demand for larger and relatively more accurate pronunciation dictionaries increase. Correspondingly, demand also increases for automated systems and techniques to produce pronunciation dictionaries that are relatively less costly to generate, relatively fast, and configured to generate relatively accurate pronunciations.
What is needed specifically are automated development methods and systems that provide automated generation of pronunciations that relatively accurately match acoustic samples of words spoken by a set of speakers.