In a video production environment, a script serves as a roadmap to when and how elements of a movie/video will be produced. In addition to specifying dialogue to be recorded, scripts are a rich source of additional metadata and include numerous references to characters, people, places, and things. During the production process, directors, editors, sound engineers, set designers, marketing, advertisers, and other production personnel are interested in knowing which people, places, and things occurred or will occur in certain scenes. This information is often present in the script but is not typically directly correlated to the corresponding video content (e.g., video and audio) because timing information is missing from the script. That is, elements of the script are not correlated with a time in which they appear in the corresponding video content. Thus, it may be difficult to link script elements (e.g., spoken dialogue) with the time when they actually occur within the corresponding video. For example, although production personnel may know that a character speaks a certain line of dialogue in a scene based on the script, the production personnel may not be able to readily determine the precise time in the working or final video when the particular line was spoken. A full script can include several thousand script elements or entities. If one were to try to find the actual point in time when a particular event (e.g., when a line was spoken) in a corresponding movie/video, the video content may have to be manually searched by a viewer to locate the event such that the corresponding timecode can be manually recorded. Thus, production personnel may not be able to easily to search or index their scripts and video content.
When a known, written script text is time-matched to raw speech transcript produced from an analysis of recorded dialogue, the script text is said to be “aligned” with the recorded dialogue, and the resulting script may be referred to as an “aligned script.” Aligned scripts may be useful as production personnel often desire to search or index video/audio content based on the text provided in the script. Moreover, production personnel may desire to generate closed caption text that is synchronized to actual spoken dialogue in video content. However, due to variations in spoken dialogue versus the corresponding written text, as well as gaps, pauses, sound effects, music, etc. in the recorded dialogue, time aligning is a difficult task to automate. Typically, the task of time-aligning textual scripts and metadata to actual video content is a tedious task that is accomplished by a manual process that can be expensive and time-consuming. For example, a person may have to view and listen to video content and manually transcribe the corresponding audio to generate an index of what took place and when, or to generate closed captioning text that is synchronized to the video. To manually locate and record a timecode for even a small fraction of the dialogue words and script elements within a full-length movie often requires several hours of manual work, and doing this for the entire script might require several days or more. Similar difficulties may be encountered while creating video descriptions for the hearing impaired. For example, a movie may be manually searched to identify gaps in dialogue for the insertion of video description narrations that describe visual elements (e.g., actions, settings) and a more complete description of what is taking place on screen.
Although some automated techniques for time-synchronizing scripts and corresponding video have been implemented, such as using a word alignment matrix (e.g., script words vs. transcript words), they are traditionally slow and error-prone. These techniques often require a great deal of processing and may contain a large number of errors, rendering the output inaccurate. For example, due to noise or other non-dialogue artifacts, in speech-to-text transcripts the wrong time values, off by several minutes or more, are often assigned to script text. As a result, the output may not be reliable, thereby requiring additional time to identify and correct the errors, or causing users to shy away from its use altogether.
Accordingly, it is desirable to provide a technique for providing efficient and accurate time-alignment of a script document and corresponding video content.