In document searching, a document may be represented as frequencies of words selected from a specific vocabulary. A similarity between documents is measured based on a comparison in terms of this frequency.
Similarly, in image searching, an image is represented as frequencies of visual words selected from a specific visual vocabulary. A similarity between images is measured based on a comparison in terms of this frequency.
By way of example, each of images 1, 2 and 3 has three features, such that the features of image 1 correspond to words “a”, “b”, “c,” respectively, the features of image 2 correspond to words “a”, “c”, “d,” respectively, and the features of image 3 correspond to words “a”, “d”, “e,” respectively. In this case, the similarity frequency between image 1 and image 2 is 2, and the similarity frequency between image 1 and image 3 is 1.
However, unlike a text document, visual words of an image frequently vary subtly due to noise, a photographing angle or the like. That is, a feature that should be represented as the word “a” for example may be represented as the word “e”, which is adjacent to the word “a”. In this case, at least one image having an identical or similar feature is searched only from the images corresponding to the identification of the word “e” in the database. Thus, there is a problem of the accuracy of the search being decreased.
For example, the technique of searching for an image using visual words, which is disclosed in a paper entitled “Video Google: A Text Retrieval Approach to Object Matching in Videos” published by Josef Sivic and Andrew Zisserman and published in “IEEE International Conference on Computer Vision” in 2003, has the problem described above.