Content-based information retrieval systems are known in which a query item such as an image or text document is presented and the system automatically analyses the content of the query item. This content analysis is then used by the information retrieval system to find other items from a database which have similar content. This type of search is sometimes referred to as “similar item” search because an example item is presented as the query. In contrast, keyword search for example, involves keywords being presented as the query to find items such as documents. There is an ongoing need to improve the performance of such content-based information retrieval systems. For example, to improve the relevance of retrieved items, to improve the speed of operation and to provide generality in the results, that is, to retrieve items that are generally similar to the query item rather than being almost identical to it or having some identical features.
As information is to be retrieved from ever larger databases of items, for example, for web-scale retrieval, the need for fast, efficient and good quality information retrieval systems grows.
A typical example of similar-item search is in the field of content-based image retrieval. This type of search has traditionally been approached as a text-retrieval problem by mapping image features into integer numbers (known as visual words) representing clusters in feature space. The mapping is defined by a dictionary specifying the feature clusters. Each image is then represented as a histogram of visual words. A pre-filtering process is used to find a small set of images having histograms likely to be similar to the histogram of the query image. Existing approaches take the list of visual words in the query image and run a search on a database to retrieve images containing any of the visual words from the query image. The retrieved images form a filter set which is then provided to a ranking system to further refine the search results. However, typical previous pre-filtering methods have retrieved over 40% of the images in the database and thus yield filter sets which are too large for web-scale retrieval. Furthermore, these previous approaches have typically used very large dictionaries of visual words which generalize poorly for measuring similarity of general object classes as opposed to specific object instances. Also where large dictionaries are used the resulting filter sets are often unsuitable for many types of ranking functions.
The embodiments described herein are not limited to implementations which solve any or all of the disadvantages of known content-based information retrieval systems.