1. Field of the Invention
The present invention is directed generally to digital image categorization, and more particularly to automatic classification of digital images and enabling semantic image searching on digital image collections.
2. Description of the Related Art
Digital images may include raster graphics, vector graphics, or a combination thereof. Raster graphics data (also referred to herein as bitmaps) may be stored and manipulated as a grid of individual picture elements called pixels. A bitmap may be characterized by its width and height in pixels and also by the number of bits per pixel. Commonly, a color bitmap defined in the RGB (red, green blue) color space may comprise between one and eight bits per pixel for each of the red, green, and blue channels. An alpha channel may be used to store additional data such as per-pixel transparency values. Vector graphics data may be stored and manipulated as one or more geometric objects built with geometric primitives. The geometric primitives (e.g., points, lines, polygons, Bézier curves, and text characters) may be based upon mathematical equations to represent parts of digital images.
Image Retrieval, Annotation, and Semantic Search
Image retrieval is the task of locating a specific image from a digital image (e.g., digital photograph, digital art, etc.) collection. A basic image retrieval method is to display a tiling of thumbnails on the screen which a user can scroll through, visually examine each thumbnail, and locate the target image.
Image annotation is the task of assigning keywords, captions, location, and/or other metadata to a photograph to add/associate semantic content to/with the image.
Semantic image search is a specific type of image retrieval in which a user searches for an image through semantically meaningful queries or interfaces. Two common semantic image search methods are search by keywords and faceted search. If images in a photo collection have been annotated with keywords, a user can retrieve an image by specifying any of these keywords, or related keywords (such as synonyms or other semantically related words), and, and the system then retrieves all images containing any or all of the keywords. A system can also provide a faceted search interface that allows the user to filter the set of images along meaningful dimensions, such as location (e.g. “taken in Seattle, Wash.”), time (e.g. “taken between years 2006 and 2008”), people (e.g. “contains at least three people”), etc.
In image retrieval, a desirable quality is the ability to locate a target image as quickly or easily as possible. Suppose a user wants to locate an image from a collection of n photos. By visually inspecting the photos one at a time, the user reduces the number of candidates from n to n−1 after one inspection and takes on average n/2 inspections to locate the image. Whereas by using a facet search interface, a user may be able to filter out half of the candidates with each successive query, reduce the number of candidates from n to n/2, and locate the image after log 2(n) queries. The latter approach is generally considered more desirable as it allows a user to retrieve the target image faster and with less effort.
Automatic Semantic Classifiers
As used herein, automatic semantic classifiers are defined as machine learning algorithms or programs that take an image as input and produce a score on how well the image matches a predefined scene, such as “waterscape”, “landscape”, “urban”, “beach”, etc. Almost all current, i.e., prior art, semantic classifiers preprocess the content of the image and produce visual features instead of learning on raw pixel values. Two common low level visual features are colors and textures. More recent approaches construct hierarchical visual features or combine colors and textures into feature themes, and classify on these secondary features instead of the original low level features. Some approaches use the metadata (e.g., camera EXIF) as input to the classifier as well.
Almost all automatic semantic classifiers learn by example. For each semantic category, a set of photographs matching the scene are manually selected and used as positive training examples. A second set of photographs (e.g., randomly or manually selected) are used as negative examples. A supervised learning algorithm is trained to separate the positive and negative examples based on the differences in the input visual features and/or input metadata.