Most of the currently available video search engines are based on “query by keyword” scenario, which are built on text search engines mainly using the associated textual information such as surrounding text from the web page, speech transcript, closed caption, and so on. However, the performance of text-based video search is yet unsatisfying, due to the mismatch between surrounding text and the associated video, as well as the low performance of automatic speech recognition (ASR), video text recognition and machine translation (MT) techniques.
FIG. 1 shows a typical process of video search reranking, in which a list of baseline search results is returned through textual information only and visual information is applied to reorder the initial results, so as to refine the text based search results. As illustrated in FIG. 1, after a query, e.g., “Soccer Match”, is submitted, an initial ranking list of video segments (i.e., shots in general) is obtained by text search engine based on the relevance between the associated textual information and the query keywords. It is observed that text-based search often returns “inconsistent” results, which means some visually similar ones (and semantically close to each other at the same time in most cases) are scattered in the ranking list, and frequently some irrelevant results are filled between them. For instance, as shown in FIG. 1, four of the top five results of the query “Soccer Match” are the relevant samples and visually similar while the other, the anchor person, is not similar. It is reasonably assumed that the visually similar samples should be ranked together. Such a visual consistency pattern within the relevant samples can be utilized to reorder the initial ranking list, e.g., to assign the anchor person a lower ranking score. Such a process, which reorders the initial ranking list based on some visual pattern, is called content-based video search reranking, or video search reranking in brief.
Video search reranking can be regarded as recovering the “true” ranking list from the initial noisy one by using visual information, i.e., to refine the initial ranking list by incorporating the text cue and visual cue. As for text cue, we mean that the initial text-based search result provides a baseline for the “true” ranking list. Though noisy, it still reflects partial facts of the “true” list thus needs to be preserved to some extent, i.e., to keep the correct information in the initial list. The visual cue is introduced by taking visual consistency as a constraint, e.g., visually similar video shots should have close ranking scores and vice versa. Reranking is actually a trade-off between the two cues. It is worthy emphasizing that this is actually the basic underlying assumption of most of the existing video search reranking approaches, though it may not be clearly presented.