Closed-captioning has been widely implemented in television broadcast systems for terrestrial and satellite broadcast. The purpose of closed captioning is to provide visible textual data in the place of auditory data. The visual data is then made available for use by the hearing impaired audience to read in place of the available audio. Current closed captioning systems provide embedded textual data prior to the transmission of the audio and video data. The textual data is then processed by a display device and the textual data is displayed in a desired format on a video screen.
Thus, prior to transmission or viewing, captioning data is presently embedded into the broadcast transmission stream at the broadcast source. Not all programs, however, are readily adaptable to this technique of embedding closed caption information. For example, it is difficult to add closed caption data to live events or to programs filmed prior to the advent of closed-captioning technology. As such, a hearing impaired viewer may not be able to view text to aid in understanding of such programs.
General purpose, speaker dependent (SD) speech recognition products are increasingly utilized to perform such tasks as, telephone based menu systems/controls and the like. These systems typically employ a Dynamic Time Warping (DTW) model. However, as the DTW model is designed to recognize entire words, as opposed to sub-components of words, thus its usefulness is limited to systems with small vocabularies. Alternatively, Hidden Markov Model (HMM) based speech recognition systems may be employed where larger vocabularies are needed as HMM systems examine word sub-components or “phonemes.”
Both the DTW and HMM systems work best when the speech recognition system is “trained” to identify the unique traits of each speaker. This training includes the creation of templates or data sets, which identify unique speech characteristics of the speaker utilizing the system to aid in the recognition of their speech. Typically, a speaker provides a set of known spoken words to the system for use in training the system. The spoken words are converted into digital data, and then a template or model of the speech is generated, the template or model includes information about various characteristics of the speech. The templates or models generated are stored in a database for use during speech recognition. Thus, input audio speech signals are processed in the same manner as the audio speech signals, which created the templates or models. The signal characteristics or data generated by the process is then compared to the templates or models. The best match between the input audio speech signals and the template or model is determined in an attempt to identify words of the audio speech signal.
As can be appreciated, pure knowledge based or “speaker independent” (SI) speech recognition system which would not require such training has increasingly become the basis for modem speech recognition applications and systems. Speaker independent systems may operate in many ways. Some SI systems employ HMMs to directly recognize whole words. These systems, however, tend to have limited vocabularies. Other types of SI systems employ robust HMMs that are trained on a number of different speakers. These systems are similar to the SD systems as they parse the audio signals into phonemes.