Content-based Image Retrieval (CBIR) has been extensively studied in recent years due to the explosive growth of online and offline image databases. Researchers in a number of different research areas have developed CBIR using different approaches.
Researchers in the computer vision and machine learning areas tend to focus on fully automatic approaches that aim to train computers to automatically understand image content. Typical approaches include region-based image retrieval, image attention detection, and multi-instance learning. However, due to the extreme diversity of general image content, the computational cost, and the low-level nature of most vision-based image understanding algorithms, fully automatic CBIR is far from being a real application.
Researchers in the multimedia processing community have taken a less ambitious approach by involving human interaction in the image searching process. One notable approach is the relevance feedback algorithm. It allows users to label positive and negative samples in order to iteratively improve the search results. This approach can indeed improve the search performance in some cases because of the human involvement.
Unfortunately, the improvement is often limited and outweighed by the added trouble of manually labeling many samples. Like computer vision-based approaches, research on improving relevance feedback has focused on improving the feature extraction and automatic learning algorithms on the feedback samples. Inevitably, these approaches hit a similar bottleneck as the vision-based approaches, such as computational cost and the problem of using low-level features to describe high-level semantic content.
The difficulties with CBIR and the intense demand for image search applications, especially for the Internet, have led commercial companies to take a different route to image searching/text-based image searching. Most current conventional image search engines take advantage of the cognitive ability of human beings by letting the human user label images with tags, and then conduct a text-based image search. This is a rather practical approach that can generate immediate results, but with great limitations. The acquisition of image tags, though it can be assisted by image metadata such as surrounding text and search annotations, can hardly obtain satisfactory results without brute force human labeling. Moreover, for large existing stock image collections and personal desktop photos, there is no surrounding text to assist the search. More importantly, images naturally contain much richer information than text, and thus can hardly be well represented by text alone. There is a great gap between text description and image content. The cliché “an image is worth a thousand words” is unfortunately true in most image search situations. Thus current text-based search results are far from satisfactory.