1. Field of the Invention
The present invention relates generally to the field of assigning keywords to media objects located in files stored in a database.
2. Description of the Related Art
With the explosive growth of information that is available through the World-Wide Web ("WWW"), it is becoming increasingly difficult for a user to find information that is of interest to him/her. Therefore, various search mechanisms that allow a user to retrieve documents of interest are becoming very popular. However, most of the popular search engines today are textual. Given one or more keywords, such search engines can retrieve WWW documents that have those keywords. Although most WWW pages have images, the current image search engines on the WWW are primitive.
There are two major ways to search for an image. First, a user can specify an image and the search engine can retrieve images similar to the specified image. Second, the user can specify keywords and all images relevant to the user specified keywords can be retrieved. The present inventor has been involved in the development of an image search engine called the Advanced Multimedia Oriented Retrieval Engine (AMORE). See S. Mukherjea et al, "Towards a Multimedia World-Wide Web Information Retrieval Engine," Proceedings of the Sixth International World-Wide Web Conference, pages 177-188, Santa Clara, Calif., April 1997; and http.//www.ccrl.com/amore. AMORE allows the retrieval of WWW images using both of the techniques. In AMORE the user can specify keywords to retrieve relevant images or can specify an image to retrieve similar images.
The similarity of two images can be determined in two ways: visually and semantically. Visual similarity can be determined by image characteristics like shape, color and texture using image processing techniques. In AMORE, Content-Oriented Image Retrieval (COIR) is used for this purpose. See K. Hirata et al., "Media-based Navigation for Hypermedia Systems," Proceedings of ACM Hypertext '93 Conference, pages 159-173, Seattle, Wash., November 1993. When a user wants to find images similar to a red car, COIR can retrieve pictures of other red cars. However, it may also be possible that the user is not interested in pictures of red cars, but pictures of other cars of similar manufacturer and model. For example, if the specified image is an Acura NSX, the user may be interested in other Acura NSX images. Finding semantically similar images (i.e. other images having the same or similar associated semantics) is useful in this example. Considering another example, a picture of a figure skater may be visually similar to the picture of an ice hockey player (because of the white background and similar shape), but it may not be meaningful for a user searching for images of ice hockey players. Finding semantically similar images will be useful in this example as well.
In order to find images which are semantically similar to a given image, the meaning of the image must be determined. Obviously this is not very easy. The best approach would be to assign several keywords to an image to specify its meaning. Manually assigning keywords to images would give the best result, but is not feasible for a large collection of images. Alternatively, the text associated with images can be used as their keywords. Unfortunately, unlike written material, most HyperText Markup Language (HTML) documents do not have an explicit caption. Therefore, the HTML source file must be parsed and only keywords "near" an image should be assigned to it. However, because the HTML page can be structured in various ways, the "nearness" is not easy to determine. For example, if the images are in a table, the keywords relevant to an image may not be physically near the image in the HTML source file. Thus, several criteria are needed to determine the keywords relevant to an image.
There are many popular WWW search engines, such as Excite (http://www.excite.com) and Infoseek (http://www.infoseek.com). These engines gather textual information about resources on the WWW and build up index databases. The indices allow the retrieval of documents containing user specified keywords. Another method of searching for information on the WWW is manually generated subject-based directories which provide a useful browsable organization of information. The most popular one is Yahoo (http://www.yahoo.com). However, none of these systems allow for image searching.
Image search engines for the WWW are also being developed. Excalibur's Image Surfer (http://isurf.yahoo.com) and WebSEEk (see S. Chang et al., "Visual Information Retrieval From Large Distributed Online Repositories," Communications of the ACM, 40(12):63-71, December 1997) have built a collection of images that are available on the WWW. The collection is divided into categories (like automotive, sports, etc), allowing a user to browse through the categories for relevant images. Keyword searching and searching for images visually similar to a specified image are also possible. Alta Vista's Photo Finder (http://image.altavista.com) also allows keyword and visually similar image searches. However, semantically similar searching is not possible in any of these systems.
WebSeer is a crawler that combines visual routines with textual heuristics to identify and index images on the WWW. See C. Frankel et al., "WebSeer: An Image Search Engine for the World-Wide Web," Technical Report 96-14, University of Chicago, Computer Science Department, August 1996. The resulting database is then accessed using a text-based search engine that allows users to describe the image that they want using keywords. The user can also specify whether the desired image is a photograph, animation, etc. However, the user can not specify an image and find similar images.
Finding visually similar images using image processing techniques is a developed research area. Virage (see J. R. Bach et al., "The Virage Image Search Engine: An Open Framework for Image Management," Proceedings of the SPIE--The International Society for Optical Engineering: Storage and Retrieval for Still Image and Video Databases IV, San Jose, Calif., February 1996) and QBIC (see M. Flickner et al., "Query by Image and Video Content: The QBIC System," IEEE Computer, 28(9):23-48, September 1995) are systems for image retrieval based on visual features, which consist of image primitives, such as color, shape, or texture and other domain specific features. Although they also allow keyword search, the keywords need to be manually specified and there is no concept of semantically similar images.
Systems for retrieving similar images by semantic content are also being developed. See A. Smeaton et al., "Experiments on using Semantic Distances between Words in Image Caption Retrieval," Proceedings of the ACM SIGIR '96 Conference on Research and Development in Information Retrieval, pages 174-180, Zurich, Switzerland, August 1996 and Y. Aslandogan et al., "Using Semantic Contents and WordNet in Image Retrieval," Proceedings of the ACM SIGIR '97 Conference on Research and Development in Information Retrieval, pages 286-295, Philadelphia, Pa., July 1997. However, these systems also require that the semantic content be manually associated with each image. For these techniques to be practical for the WWW, automatic assignment of keywords to the images is essential.
Research looking into the general problem of the relationship between images and captions in a large photographic library like a newspaper archive has been undertaken. See R. Srihari, "Automatic Indexing and Content-based Retrieval of Captioned Images," IEEE Computer, 28(9):49-56, September 1995 and N. Rowe, "Using Local Optimality Criteria for Efficient Information Retrieval with Redundant Information Filters," ACM Transactions on Information Systems, 14(2):138-174, March 1996. These systems assume that captions have already been extracted from the pictures, an assumption not easily applicable to the WWW.
Various techniques have been developed for assigning keywords to images on the WWW. However, none of these techniques can perform reasonably well on all types of HTML pages. Also, problems exist because different people put captions for images in different locations. Thus, it is difficult to establish a single procedure for assigning keywords to images. Further, it is possible that in the source file for a document, a caption will be located between two images or distant from the single relevant image. In such a case, it is difficult to determine how the caption will be assigned.
WebSEEk uses WWW Universal Resource Locator (URL) addresses and HTML tags associated with images to extract the keywords. See S. Chang et al., "Visual Information Retrieval From Large Distributed Online Repositories," Communications of the ACM, 40(12):63-71, December 1997. This will result in low recall since not all of the text surrounding an image is considered.
Harmandas et al. uses the text after an image URL until the end of a paragraph or until a link to another image is encountered as the caption of the image. See V. Harmandas et al., "Image Retrieval by Hypertext Links," Proceedings of the ACM SIGIR '97 Conference on Research and Development in Information Retrieval, pages 296-303, Philadelphia, Pa., July 1997. Harmandas et al. evaluated the effectiveness of retrieval of images based on (a) the caption text, (b) caption text of other images of the page, (c) the non-caption text of the page and (d) the full-text of all pages linked to the image page. However, this method of defining captions will not work for the situation where a collection of images in a WWW page is described by a single caption at the top or bottom of the page. An example of this situation is shown in FIG. 1. Moreover, indexing an image by the full-text of all pages linked to the image page may result in many irrelevant images being retrieved.
The Marie-3 system uses text "near" an image to identify a caption. See N. Rowe et al., "Automatic Caption Localization for Photographs on World-Wide Web Pages," Information Processing and Management, 34(1):95-107, 1998. "Nearness" is defined as the caption and image being within a fixed number of lines in the parse of the source HTML file. There is an exception if an image occurs within these lines. In this case the caption-scope nonintersection principle is true. This principle states that the scope for a caption of one image cannot intersect the scope for a caption of another image. Although Rowe et al. found this principle to be true in all of their examples, they considered a small section of the WWW. In some cases the same caption is used for a collection of images, as shown in FIG. 1. This figure also shows that defining nearness to be a fixed number of lines in the source file will not work because a caption at the top of a page can describe a group of images on the page.
WebSeer, discussed briefly above, considers various features as criteria to index the images. For example, image name, ALT tags, HTML tags, text in hyperlinks and image captions can be used. In one particular example, the caption of an image may be the text in the same center tag (used to place the image within the HTML document as displayed) as the image, within the same cell of a table as the image or the same paragraph as the image. See C. Frankel et al., "WebSeer: An Image Search Engine for the World-Wide Web," Technical Report 96-14, University of Chicago, Computer Science Department, August 1996. However, it appears that this system will not assign all the relevant text of an image if the image and text are arranged in a table, since the system only assigns the text in the same cell as the image to the image. For example, for the table shown in FIG. 2, the image and the text relevant to it are in different cells.