The amount of speech-based audio and audio-visual content created for human consumption has substantially increased. Such content has led to the extensive use of narration track recordings in the form of speeches, podcasts, advertisements, films, tutorial videos etc. Additionally, the duration of these recordings often exhibit a large variability, ranging from a few seconds to over multiple hours of recordings. Audio editing tools have enabled the user to manipulate audio signals by using a multitude of operations to create a high-quality narration audio. Content-based editing tools and tools that provide immediate feedback about the speech have been proposed for efficiently recording narration tracks.
However, the creation of narration tracks is an error-prone process. Unintentional mispronunciations, pauses, non-lexical utterances (e.g., “huh,” “um,” “like,” etc.), and other speech disfluencies are commonly encountered in narration tracks. Sudden transient events in uncontrolled environments (e.g., a sneeze in a lecture) can also obscure one or more words of the narration audio. A process known as “redubbing” enables such errors to be corrected without having to re-record an entire narration sequence.
An illustrative example of the process of redubbing includes the case of replacing an incorrect word in an audio recording. A user may re-record only the sentence that contains the wrong word, manually determine the position of the error in the original recording, and replace the error by the new correct sentence. But the position of the error may be difficult to determine. For instance, manually redubbing certain speech signals are prohibitively difficult due if a user is unfamiliarity with sophisticated audio editing tools, even if the audio recording is relatively short (e.g., less than one minute). In instances where the audio recording including the error is more robust (e.g., a few hours or longer), locating the error may be difficult even for the most knowledgeable users.
Further, once the error is identified, replacing the error with the corrected audio in a seamless manner may be difficult. For example, variations in acoustics or background noise included in the recordings (e.g., the original and replacement audio recordings being recorded in different rooms having different acoustic properties) may result in the corrected portion being easily detectable in the redubbed audio recording.