As home video collections grow in size, video management technologies are becoming increasingly important. One such example is video summarization, which can provide a summary of each video in the user's collection or library. The summary may comprise a collection of clips from the video that include scenes of interest, particular acoustic events, and/or persons of interest in a scene. Video summarization can be based, at least in part, on speaker recognition to identify voices of persons of interest in any given scene. Speaker recognition technologies, however, generally rely on the use of speaker models to achieve satisfactory results. These models are created through lengthy training/enrollment sessions which can involve the collection of five to ten minutes worth of speech from each speaker. In real world applications, such as video summarization, it is impractical or unrealistic to expect all of the persons or characters in the video to be available for speaker model training.
Although the following Detailed Description will proceed with reference being made to illustrative embodiments, many alternatives, modifications, and variations thereof will be apparent in light of this disclosure.