1. Field of the Invention
The present invention relates generally to capture of multimedia in mobile devices and, more particularly, to systems and methods for multimedia annotation with sensor data.
2. Description of the Related Art
User generated video content is experiencing significant growth which is expected to continue and further accelerate. As an example, users are currently uploading twenty hours of video per minute to YouTube. Making such video archives effectively searchable is one of the most critical challenges of multimedia management. Current search techniques that utilize signal-level content extraction from video struggle to scale.
Camera sensors have become a ubiquitous feature in the environment and more and more video clips are being collected and stored for many purposes such as surveillance, monitoring, reporting, or entertainment. Because of the affordability of video cameras the general public is now generating and sharing their own videos, which are attracting significant interest from users and have resulted in an extensive user generated online video market catered to by such sites as YouTube. AccuStream iMedia Research has released a report forecasting that the user generated video market size is expected to expand 50% from 22 million in 2007 to 34 million in 2008. The report was based on data from popular video content providers including AOLUncut, Broadcaster.com, Crackle.com, Ebaumsworld, LiveDigital, Metacafe, MySpace TV, Revver.com, Vsocial.com, VEOH.com, Yahoo Video and YouTube. By 2010, more than half (55%) of all the video content consumed online in the US is expected to be user generated, representing 44 billion video streams. Companies are developing various business models in this emerging market, with one of the more obvious ones being advertising. In 2008, Forrest Research and eMarketer reported that the global online video advertising market will reach more than 7.2 billion by 2012.
Many of the end-user cameras are mobile, such as the ones embedded in smartphones. The collected video clips contain a tremendous amount of visual and contextual information that makes them unlike any other media type. However, currently it is still very challenging to index and search video data at the high semantic level preferred by humans. Effective video search is becoming a critical problem in the user generated video market. The scope of this issue is illustrated by the fact that video searches on YouTube accounted for 25% of all Google search queries in the U.S. in November of 2007. Better video search has the potential to significantly improve the quality and usability of many services and applications that rely on large repositories of video clips.
A significant body of research exists—going back as early as the 1970s—on techniques that extract features based on the visual signals of a video. While progress has been very significant in this area of content based video retrieval, achieving high accuracy with these approaches is often limited to specific domains (e.g., sports, news), and applying them to large-scale video repositories creates significant scalability problems. As an alternative, text annotations of video can be used for search, but high-level concepts must often be added manually and hence its use is cumbersome for large video collections. Furthermore, text tags can be ambiguous and subjective.
Recent technological trends have opened another avenue to associate more contextual information with videos: the automatic collection of sensor metadata. A variety of sensors are now cost-effectively available and their data can be recorded together with a video stream. For example, current smartphones embed GPS, compass, and accelerometer sensors into a small, portable and energy-efficient package. The meta-data generated by such sensors represents a rich source of information that can be mined for relevant search results. A significant benefit is that sensor meta-data can be added automatically and represents objective information (e.g., the position).
Some types of video data are naturally tied to geographical locations. For example, video data from traffic monitoring may not have much meaning without its associated location information. Thus, in such applications, one needs a specific location to retrieve the traffic video at that point or in that region. Unfortunately, current devices and methods are not adequate.