In image processing and use, it is often desirable to create a caption describing an image. A caption is a phrase that describes the content of an image. For example, a caption that says “business people sitting around a large conference table” may describe an image of that content. Captions can identify the objects in an image, relationships amongst the objects, and provide other details about the image. Captions can also draw attention to certain image features which otherwise may be overlooked, and can be used to categorize the image for filing and subsequent retrieval. It is very time-consuming to manually caption a large number of digital images. Manual captioning also is influenced by human error, which leads to captioning errors.
Computerized techniques have been used to caption images. However, conventional computerized techniques often produce captions that are not sufficiently accurate (e.g., a group of bullets laying side-by-side is described by a conventional computerized technique as a pack of cigarettes), produce captions which are too long, produce captions that are unnaturally composed (e.g., “a teddy bear sitting on a chair with a stuffed animal,” “a street sign with a street sign on it”), or a combination thereof. Conventional computerized techniques also often fail to describe objects which would interest a human viewing the image, and mistakenly align attributes with the wrong object. For example, in an image in which a tennis player's shorts are white, the automatically-generated caption may indicate that the shorts are black.