In the field of computer technology, past decade has witnessed massive advancements in various applications, such as automatic speech recognition (ASR) of multimedia content (e.g., video lecture). In a traditional environment, an ASR system may extract audio content of the multimedia content, and thereafter, generate a text transcript of the audio content along with estimates of corresponding time stamps.
In certain scenarios, the ASR system performance may deteriorate if the audio content is on a niche topic, for example, mathematical formulation of laws of gravity. Instructional videos usually fall in this category of niche topics. Such deterioration in the performance of the ASR system is largely due to lack of domain specific esoteric words. Further, there may not be adequate domain-specific data to train the ASR system. Another factor that deteriorates the performance of the ASR system, in case of the niche topics, is drift among topics in the multimedia content, hence, requiring a different ASR system that may be trained to cover such topics.
Further, the recognition results of the ASR system may contain a huge amount of phoneme-based errors due to misinterpretation of words, spoken in the multimedia content, by the ASR system. Such errors may degrade the quality of the transcript of the multimedia content, and furthermore, may confuse the user. Thus, there is a need for an improved, efficient and automated mechanism for obtaining error-free transcripts of the multimedia content.
Further limitations and disadvantages of conventional and traditional approaches will become apparent to a person having ordinary skill in the art, through a comparison of described systems with some aspects of the present disclosure, as set forth in the remainder of the present application and with reference to the drawings.