The proliferation of digital cameras, both stand-alone (e.g., digital single-lens reflex cameras) and cameras integrated into computing devices (e.g., included in a smart phone) has caused a proliferation of images for many users. Often times, images are less than perfect. For instance, an image may be taken by a novice, and thus be poorly composed, or circumstances, such as weather, may influence a background (e.g., color of the sky) in an image. Hence, a user may desire to edit an image, such as with an image editing application. To accomplish editing tasks with an image editing application and produce a naturally-appearing image (e.g., an image in which an observer cannot distinguish where in the image the editing occurred) requires a significant skill level and effort, because of the complexity of the editing process. For instance, a trained professional (e.g., one skilled in use of the image editing application) may take hours to produce a single image according to a request to merely replace an object in the image with another object, such as an object from another image.
Because of the complexity of the image editing process, and the infinite variety of words a user can speak in various languages and with various dialects, most image editing applications either do not include voice interfaces, or have limited abilities for fulfilling limited spoken commands. For instance, Adobe's PixelTone application can receive a spoken editing query from a user for an image to be edited, such as “Make the man brighter”, but the PixelTone application has no semantic knowledge of the image, and does not participate in a user conversation. Consequently, the user must first manually select “the man” in the image in this example, such as by painting over the man with a paintbrush tool, before requesting to “Make the man brighter”, which significantly limits the usefulness of the voice interface. Hence, image editing applications do not direct a user conversation, but rather merely receive limited spoken commands.
Moreover, image editing applications do not receive multi-modal user input, including a complementary user input during a user conversation in addition to speech input during the user conversation. Consequently, image editing applications with voice interfaces are limited to the effectiveness of the image editing application to process spoken input, without gaining the benefit of other forms of user input during a user conversation.