Recorded audio narration plays a crucial role in many scenarios including animation, computer games, demonstration videos, documentaries, and podcasts. After narration is recorded, most of these applications require editing. Typical audio editing interfaces present a visualization of the audio waveform and provide the user with standard select, cut, copy and paste operations (in addition to low-level operations like time stretching, pitch bending, or envelope adjustment), which are applied to the waveform itself.
Such interfaces can be cumbersome, especially for non-experts. Researchers have addressed this problem by aligning the waveform with a transcript of the narration, and providing an interface wherein the user can perform cut-copy-paste operations in the text of the transcript. Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner.
While cut-copy-paste operations are supported, one aspect remains conspicuously missing from text-based audio editors: insertion and replacement. In many circumstances inserting or replacing a new word or phrase during editing would be useful, for example replacing a misspoken word or inserting an adjective for emphasis. While it is easy for a person to type a new word not appearing in the transcript, it is not obvious how to synthesize the corresponding audio. The challenge is to synthesize the new word in a voice that matches the rest of the narration.
It is possible to record new audio of just the missing word, but to do so requires access to the original voice talent. Moreover, even when the original narrator, microphone and acoustic environment are available for a new recording, it remains difficult to match the audio quality of an inserted word or phrase to the context around it. Thus, an insertion or replacement is often evident in the edited audio. Regardless, just as it is easier to type than to edit audio waveforms for cut and paste operations, it is also easier to type for insertion or replacement rather than record new audio.
Voice conversion (“VC”) refers to any algorithm for making an utterance of one person sound as if it were made by another. Approaches to VC typically rely on a training set of parallel utterances spoken by both the source and target. State of the art parametric methods then explicitly model a conversion function mapping from the source to the target in some feature space such as MFCC (“Mel Frequency Cepstral Coefficients”) or STRAIGHT. A new source utterance (the query) may be transformed into the feature space and then mapped through the conversion function to match the target. The output of such parametric methods must be re-synthesized from these features, and artifacts are inevitable since these feature spaces do not perfectly model human voice. Thus, the converted speech usually has a muffled effect as a result of re-synthesis.
In order to avoid artifacts due to re-synthesis, an alternative to the parametric approach relies on a technique referred to as “unit selection”. The basic idea of unit selection is to choose segments of the target speaker's training samples whose corresponding source samples sound like the query, while also seeking smooth transitions between neighboring segments. Modern text-to-speech synthesis systems demonstrate that unit selection can generate high quality speech with high individuality, which is crucial for VC. These systems require very large training sets (many hours up to days) as well as substantial human annotation. Yet, in typical VC applications, there exists a limited training set (e.g., 1 hour) and no manual effort is possible.
Thus, although VC systems are known, to provide practical text based insertion and replacement interactively in an audio narration using a technique such as unit selection requires several key improvements. First, a VC algorithm must be highly optimized so that it is fast enough to allow for an interactive experience. Second, the VC algorithm must provide high quality converted speech.