As camcorders become widely used to capture memorable experiences and document daily lives, the quantity of home video data increases dramatically. But most video recordings are reserved in storage and seldom viewed due to the relatively low content quality of raw homemade videos, despite the personalized subject matter. It is difficult to turn the raw video data into a useful, well-organized, and easy-to-access collection or database. After a long period of time, camcorder users may even forget why they captured the video clips in the first place.
Conventional systems for home video content analysis and organization are designed from the perspective of a viewer. Generally, there are three widely accepted approaches for such applications: video structuring, highlight detection, and authoring.
Video structuring discovers home video structure and provides users with a compact summary of the content. For example, structure can be derived by clustering stamped date information, and the importance of a structure unit can be derived from a sound feature.
In contrast, highlight detection aims at mining specific patterns in home videos for dynamic summarization. For example, the visual significance of a zoom-and-hold camera operation can be used to find interesting segments. Both static and moving patterns can be detected in an elementary structure unit called a “snippet” for pattern indexing in home videos. Since automatic highlight identification is still a challenging issue, a user interface enables only a semi-automatic ability to find highlights.
Recently, many systems have offered home video authoring, and focus on creating a new video clip from many old ones, with additional effects added. Suitable clips can be assigned numerical suitability scores, organized by the users into a storyboard, and then concatenated automatically as the final video. The created video can be regarded as a dynamic summary based on the user's interest level. Another system provides dynamic highlights of home video content by selecting desirable high quality clips and linking them with transition effects and incidental music. The linking can even correct lighting and remove shaking by stabilizing the clips.
It is evident that existing algorithms and conventional systems for home video content analysis are all designed from the viewer's perspective. But the viewer's perspective is not as effective for classifying video content as the mental state of the original camcorder operator would be. Moreover, the development of psychology and computer vision techniques, especially studies of the visual attention model, have alleviated the semantic gap between low-level visual stimuli and high-level intention concept. This has made it practical to estimate the capture intention.