With the development of the Internet and video delivery technologies, currently video has been a significant part of the Internet traffic. Also, people are used to multi-screen experiences when watching videos. For example, nearly half of US population use smart phones when watching TV, and it is observed that 47% of users use mobile phones to engage in TV related activities.
However, the human interaction with video is still quite limited due to the complexity of video structure. That is to say, the interactions between mobile phones and TVs are still quite limited. With existing technology, the mobile phone can be used to identify products, CDs, books, and other print media, but it is difficult or impractical for such mobile phone applications in the video area that deals with high-volume data.
For example, considering a TV program with 200 channels, that is, the throughput of the program is 6000 frames per second. If a 10-minute lagging period is allowed, the system needs to be able to recognize 3.6 Million pictures (=6000×10×60) with very low latency. Hence, the difficulty is twofold: high throughput to handle, and large image database for content recognition. Typically, the content recognition involves high computation amount, but the computation is generally related to the number of frames involved in classification/recognition. The more frames involved, the higher classification/recognition accuracy, but also higher amount of computation.
The disclosed methods and systems are directed to solve one or more problems set forth above and other problems.