Personal digital video photography is increasing in popularity while, at the same time, the cost of digital storage media continues to decrease. As a result, the number of libraries of stored digital media continues to increase. For example, many consumers hold large libraries of digital media and, thus, the need to search and retrieve vast amounts of stored video data has increased significantly.
Due to the large amounts of data associated with each video element (e.g., segment, clip, etc.), current video search and retrieval methods are typically not suited to accurately locate video elements stored in a library. Instead of relying on keyword annotation for indexing and search, visual image features are typically used to search and retrieve video elements. However, known video search and retrieval methods using a single frame are typically not very accurate because video has a temporal aspect or dimension that such single frame techniques do not consider. Other known video search and retrieval methods may use explicit object tracking across multiple frames. However, with these object tracking techniques it is difficult to track a selected object when other objects enter or leave a scene (e.g., a video element, clip, etc.)
In yet another known method for extracting representative data from a video element, the entire video element is analyzed and a set of locally distinctive points are extracted from the pixels of the complete video element. The extracted points are then processed using a hierarchical mean shift analysis. In particular, using mean shift analysis and implementing a fine to coarse evaluation, with iterations expanding the radius of interest, the calculations eventually encompass a single region corresponding to the whole set of pixels. Thus, all of the pixels from the entire video element are processed at one time, resulting in a complete but computationally and memory intensive representation for video indexing and retrieval.