1. Field
The following description relates to an apparatus and method of extending a pronunciation dictionary for pronunciation transcription correction of a speech database used for an acoustic model training for speech recognition.
2. Description of the Related Art
In general, a high-capacity speech recognition system may use an acoustic model, a language model, and a pronunciation dictionary. An acoustic model may be used to recognize a characteristic of a speech signal.
Speech recognition systems may use two types of files to recognize speech, an acoustic model and a language model. The acoustic model is typically created by taking audio recordings and compiling them into statistical representations of the sounds that make up each word. The compiling is often referred to as training. The language model is generally a file containing the probabilities of a sequence of words.
A mass storage speech database is used for the acoustic model. Also, a process of extracting a characteristic from the speech database and training as an acoustic model may be needed.
A speech database, used for an acoustic model, may include sound data, for example, voice and text data indicating the voice. Sounds and texts are to be matched with each other for an accurate acoustic modeling. Otherwise, an optimized acoustic model may not be obtained, and a performance of a speech recognition system may be degraded.
A speech database may be established when a plurality of speakers read a previously selected utterance. Often, an utterance may not be read as written due to a linguistic phenomenon such as a fortis and lenis and/or an allophone. Accordingly, a pronunciation dictionary, used for an acoustic model training, may be used.
However, although the pronunciation dictionary is built based on a linguistic phenomenon, all pronunciation variations may not be considered, and an utterance may be pronounced differently by different speakers. For example, a non-linguistic phenomenon may occur based upon an education level, a growth process, and/or an age of a speaker. Further, a speaker may not accurately pronounce an utterance when recording.