Described embodiments relate generally to product annotation in a video and specifically to in-video product annotation using web information mining.
With the rapid advances in storage devices, networks and compression techniques, video data from different domains are growing at an explosive rate. Video annotation (also widely known as video concept detection or high-level feature extraction), which aims to automatically assign descriptive concepts to video content, has received intensive research interests over the past few years. However, most existing works on automatic video annotation focus on high level concepts, such as events (e.g., airplane crash and running), scenes (e.g., sundown and beach) and object categories (e.g., car and screen), and there is few research on annotating specific product concepts, such as iPhone in an iPhone video advertisement.
Annotation of product concepts is of great importance to many applications such as video browsing, searching and advertising. Research on the query log of web video search shows that users are more frequently use specific queries than general concepts. Further, product annotation is able to significantly improve the relevance of video advertising. However, automated annotation of products is challenging because of the insufficiency of training data and the difficulty in generating appropriate visual representations.
The first challenge of automated product annotation lies on the training data for annotation. Existing learning-based video annotation approach heavily relies on the quality of training data, but manually collecting training samples is time-consuming and labor intensive. In particular, there is a multi-view problem for product images. A specific product usually has different views, such as frontal, side and the back views, and these views can be quite visually different. Therefore, there is a need to collect training data that are descriptive for different views of a product.
The second challenge is the effective visual representation. Bag of Visual Words (BoVW) feature is a popular approach and has demonstrated its effectiveness in many applications, such as image classification, clustering, and retrieval. To generate a BoVW representation of an image, Scale Invariant Feature Transform (SIFT) descriptors on multiple detected keypoints or by densely sampling patches of a product image are extracted and quantized into visual words. A BoVW histogram is generated to describe the product image. However, the descriptors of an image are about the whole image and not the product parts contained in the image and contain a lot noise for product annotation.