Automatically generating natural language descriptions of images has attracted increasing interest due to practical applications for image searching, accessibility of visually impaired people, and management of image collections. Conventional techniques for image processing do not support high precision natural language captioning and image searching due to limitations of conventional image tagging and search algorithms. This is because conventional techniques merely associate tags with the images, but do not define relationships between the tags nor with the image itself. Moreover, conventional techniques may involve using a top-down approach in which an overall “gist” of an image is first derived and then refined into appropriate descriptive words and captions through language modeling and sentence generation. This top-down approach, though, does not do a good job of capturing fine details of images such as local objects, attributes, and regions that contribute to precise descriptions for the images. As such, it may be difficult using conventional techniques to generate precise and complex image captions, such as “a man feeding a baby in a high chair with the baby holding a toy.” Consequently, captions generated using the conventional techniques may omit important image details, which makes it difficult for users to search for specific images and fully understand the content of an image based on associated captions.