Applications for image enhancing, annotating, redacting, or other such image-editing tasks are now widespread. Many such applications include functionalities to segment an image into multiple regions. For instance, a user may wish to identify a portion of the image that is associated with an object in the foreground and another portion of the image that is associated with the background of the image. Some applications enable a user to draw a bounding box around such regions. However, such manual functionalities often require significant user interaction and provide only gross-level feature segmentation.
Other previously available systems enable a user to provide a natural language phrase to segment an image. Such systems identify latent features of the image and latent features of the entire phrase. The latent features of the image and the phrase are combined to segment the image. More specifically, these previously available systems process the entirety of the phrase to detect latent features of the phrase. Only after the entirety of the phrase has been processed, the latent features of the image and the phrase are combined to segment the image. In this regard, the latent features of the image are combined with the latent features of the phrase only once and at the end of the phrase processing. Thus, the segmentation of the image is based only on a single interaction of the image and the phrase latent features after the entirety of the phrase has been processed. Segmenting an image upon on analysis of an entire expression, however, can result in an inaccurate segmentation (e.g., incorrect spatial arrangement). By way of example only, based on the expression “the dog on the right” without the image perspective, existing technologies may not recognize whether to focus on the “dog” or “on the right” with regard to image segmentation.