1. Field of the Invention
The present invention relates to content-based search and retrieval of visual images from digital image databases. More particularly, this invention relates to a content-based image retrieval system that learns a desired visual concept based on user feedback information on selected training images.
2. Description of the Related Art
With advances in computer technologies and the World Wide Web (WWW), content-based image retrieval has gained considerable attention recently. The following documents provide background information relating to this field; each of these documents is incorporated by reference herein in its entirety.
[1] J. Smith and S. Chang, xe2x80x9cVisualSEEK: A Fully Automated Content-based Image Query Systemxe2x80x9d, Proc. ACM International Conference on Multimedia, pages 87-98, November 1996.
[2] W. Niblack, R. Barber et al., xe2x80x9cThe QBIC Project: Querying Images by Content Using Color, Texture, and Shape,xe2x80x9d in Proc. SPIE Storage and retrieval for image and video Databases, vol. 1908, pages 173-187, February 1993.
[3] J. R. Bach, C. Fuller et al., xe2x80x9cThe Virage Image Search Engine: An Open Framework for Image Management,xe2x80x9d in Proc. SPIE and Retrieval for Image and Video Databases. 
[4] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra, xe2x80x9cRelevance Feedback: A Power Tool for Interactive Content-based Image Retrievalxe2x80x9d, IEEE trans. Circuits and systems for video technology, vol. 8, no. 5, pages 644-655, September 1998.
[5] P. Lipson, E. Grimson, and P. Sinha, xe2x80x9cContext and Configuration Based Scene Classificationxe2x80x9d, Proceedings of IEEE Int. Conf. On Computer Vision and Pattern Recognition, 1997.
[6] O. Maron, xe2x80x9cLearning from Ambiguityxe2x80x9d, Doctoral Thesis, Dept. of Electrical Engineering and Computer Science, M.I.T., June 1998.
[7] O. Maron and A. L. Ratan, xe2x80x9cMultiple Instance Learning from Natural Scene Classificationxe2x80x9d, Proceedings of the 14th International Conference on Machine Learning, 1997.
[8] U.S. Pat. No. 5,793,888 to Delanoy, xe2x80x9cMachine Learning Apparatus and Method for Image Searching,xe2x80x9d 1998.
[9] U.S. Pat. No. 5,696,964 to Ingemar, et al., xe2x80x9cMultimedia Database Retrieval System . . . ,xe2x80x9d 1997.
[10] U.S. Pat. No. 5,586,316 to Tanaka et al., xe2x80x9cSystem and Method for Information Retrieval with Scaled Down Image,xe2x80x9d 1996.
[11] U.S. Pat. No. 5,588,149 to Hirose, xe2x80x9cDocument Classification and Retrieval with Title-Based On-The-Fly Class Merge,xe2x80x9d 1996.
[12] U.S. Pat. No. 5,623,690 to Palmer et al., xe2x80x9cAudio/Video Storage and Retrieval for Multimedia Workstations by Interleaving Audio and Video Data in Data File,xe2x80x9d 1997.
[13] U.S. Pat. No. 5,644,765 to Shimura et al., xe2x80x9cImage Retrieving Method and Apparatus That Calculates Characteristic Amounts of Data Correlated With and Identifying an Image,xe2x80x9d 1997.
[14] U.S. Pat. No. 5,659,742 to Beattie et al., xe2x80x9cMethod for Storing Multi-Media Information in an Information Retrieval System,xe2x80x9d 1997.
Some of the terminology used herein will now be defined. xe2x80x9cLow-level featuresxe2x80x9d are properties or attributes such as color, texture and shape. U.S. Pat. No. 5,696,964 discusses specific examples of such properties. xe2x80x9cHigh-level conceptsxe2x80x9d are, for example, discernible images such as apples, trees, automobiles, etc. xe2x80x9cSpatial propertiesxe2x80x9d are size, location, and the relative relationship to other regions. A xe2x80x9ctarget conceptxe2x80x9d is the concept learned by the image retrieval system based on the results of images queried by a user. A xe2x80x9ctarget imagexe2x80x9d is defined as an image retrieved from a database by the image retrieval system as meeting the criteria of the target concept.
For image-based retrieval systems, the most commonly used approach as represented by reference [1] is to exploit low-level features and spatial properties as keys to retrieve images. Although this approach has the benefit of being simple, it leaves a large gap between the retrieval of low-level features and high-level concepts embedded in the image contents.
To address this challenge, references [5], [6] and [7] have investigated a method for learning visual concepts from a set of examples for some specific scene classification. This research has illustrated that visual concepts can be learned from a set of positive and negative examples to successfully classify natural scenes with impressive results. The framework used in this research is called xe2x80x9cMultiple Instance Learning.xe2x80x9d
The approach used in references [6] and [7] considers a data image as comprising a number of subimages or feature instances which represent different high-level visual concepts such as cars, people, or trees. In the terminology of these references, a collection of subimages or feature instances is called a bag. Each bag corresponds to one of the data images. A bag is labeled negative if all of its feature instances are negative, and positive if at least one of its feature instances is positive. Each feature instance in the bag is a possible description of a visual high-level concept. A basic assumption is that the learned or target concept can not contain any feature instances from negative bags. Given a number of bags (positive or negative), a method called Diverse Density (DD) is employed to learn the visual concept from examples or images containing multiple feature instances. The algorithm attempts to locate the optimal position of the target concept in feature space by a gradient optimization method. This optimization method iteratively updates the best position by maximizing the defined diverse density measure from multiple starting points.
However, there are several problems associated with the DD method. A major problem in the DD algorithm is the selection of starting points. In addition, iterative nonlinear optimization can be time consuming and can also prohibit on-line learning of visual concepts. Another common problem incurred in the learning process is the under-training problem, which occurs when an insufficient number of training instances, either positive or negative, are provided. Intuitively, an insufficient number of training instances causes the high-level concept embedded in the retrieved image to be imprecise because of insufficient training.
The present invention provides a novel technique that overcomes the under-training problem frequently suffered in prior art techniques. Since no time-consuming optimization process is involved, the system learns, the visual concepts extremely fast. Therefore, the target concept can be learned on-line and is user-adaptable for effective retrieval of image contents.
More particularly, this invention provides a novel method to let users retrieve interesting images by learning the embedded target visual concept from a set of given examples. Through user""srelevance feedback, the visual concept can be effectively learned to classify images, which contain common visual entities. The learning process is started by the user developing a query. In response to the user""s query, the system retrieves a set of query results from which a set of training examples (which may be positive or negative) can be selected to interactively adjust the query concept according to the user""s feedback information indicating whether or not the retrieval examples match the user""s query.
According to one embodiment, a method is provided to find the commonality of instances for learning a visual concept through a user""s feedback information. Under the assumption that high-level concepts can be captured by a set of feature instances, the method establishes a linkage between high-level concepts and low-level features. Before the training and retrieving process, each training example is first analyzed to extract a number of instances, which serve as keys to comprise the basic representation of different visual concepts. During the retrieving phase, through the user""s assignment of feedback information, the training images, subdivided into subimages (or feature instances), are labeled by the user as either positive or negative. Based on this assignment, the learning process finds the commonality of feature instances from the training examples without involving any optimization process. Then, the retrieving phase fetches desired images from databases according to the learned concept.
One major difference between the method of the present invention and the DD technique is that the present invention analyzes not only the explicit positive and negative training examples generated in response to the user""s query but also analyzes the overall retrieved output some of which is not seen by the users. The consideration of the unseen (implicit) retrieved output enlarges the number of training examples and thus obviates the under-training problem. In other words, prior art methods utilize the user""s feedback information to consider only the explicit training examples in the output list. In contrast, the present method automatically references additional implicit training images from the unseen retrieved results which do not need further analysis and assignment by the user to label the results as positive and negative. Therefore, the high level concept embedded in an image can be learned more completely and precisely. Furthermore, during the learning process, since no time-consuming optimization process is applied by this invention, real-time and on-line learning can be achieved.