The invention relates to audio/video/imagery processing, more particularly to audio/video/imagery metadata extraction and analytics.
Extraction and analysis of non-transcribed media has typically been a labor-intensive process, typically human driven, which does not allow for extensive and consistent metadata extraction in rapid fashion. One or more person has to view and listen to the source media, e.g., audio and/or video content, and manually transcribe the corresponding audio to generate an index of what took place and when, or to generate closed captioning text that is synchronized to the video. To manually locate and record a timestamp for even a small fraction of the speech and script elements often requires several hours of manual work, and doing this for the entire source media may require several days or more.
Currently available systems and methods deal with the extraction and analysis of transcribed media. Currently available systems and methods time-match a written script text to raw speech transcript produced from an analysis of recorded dialog to ensure accuracy of the transcript. That is, transcribed source media is processed and the resulting speech recognized transcript is compared to the written script to ensure accuracy. Such transcripts are used in movie industry and video production environment to search or index video/audio content based on the text provided in the written script. Also, aligned transcript can be used to generate closed caption text that is synchronized to actual spoken dialog in the source media.
These automated techniques for time-synchronizing scripts and corresponding video to pre-existing written script typically utilize a word alignment matrix (e.g., script words vs. transcript words). But, they are traditionally slow and error-prone. These techniques often require a great deal of processing and may contain a large number of errors, rendering the output inaccurate. For example, due to noise or other non-dialogue artifacts, in speech-to-text transcripts the wrong time values, off by several minutes or more, are often assigned to script text. As a result, the transcript may not be reliable, thereby requiring additional time to identify and correct the errors, or causing users to shy away from its use altogether.
The problems are exacerbated when one must extract non-transcribed media because there is no written script to compare the speech transcript for accuracy.
Accordingly, it is desirable to provide a technique for providing efficient and accurate time-aligned machine transcribed media that is normalized to a single universal amplitude scale. That is, the claimed invention proceeds upon the desirability of providing method and system for storing and applying automated machine speech and facial/entity recognition to large volumes of non-transcribed video and/or audio media streams to provide searchable transcribed content that is normalized to a single universal amplitude scale. The searchable transcribed content can be searched and analyzed for metadata to provide a unique perspective onto the data via server-based queries.