Image processing can be utilized to resolve attribute(s) for an object in an image. For example, some image processing techniques utilize image processing engine(s) to resolve classification(s) for object(s) captured in the image. For instance, for an image that captures a sailboat, image processing can be performed to resolve classification values(s) of “boat” and/or “sailboat” for the image. Additional or alternative attributes can be resolved utilizing image processing. For example, optical character recognition (OCR) can be utilized to resolve text in an image. Also, for example, some image processing techniques can be utilized to determine more particular classifications of an object in an image (e.g., a particular make and/or model of a sailboat).
Some image processing engines utilize one or more machine learning models, such as a deep neural network model that accepts an image as input, and that utilizes learned parameters to generate, as output based on the image, measure(s) that indicate which of a plurality of corresponding attributes are present in an image. If a measure indicates that a particular attribute is present in an image (e.g., if the measure satisfies a threshold), that attribute can be considered “resolved” for the image (i.e., that attribute can be considered to be present in the image). However, it may often be the case that image processing of an image may be unable to resolve one or more (e.g., any) attributes. Moreover, it may further be the case that the resolved attributes for an image do not enable definition of an object in the image with a desired degree of specificity. For example, resolved attributes of an image may enable determination that a “shirt” is present in an image, and that the shirt is “red”—but may not enable determination of a manufacturer of the shirt, whether the shirt is “short sleeve” or “long sleeve”, etc.
Separately, humans may engage in human-to-computer dialogs with interactive software applications referred to herein as “automated assistants” (also referred to as “interactive personal assistants,” “intelligent personal assistants,” “personal voice assistants,” “conversational agents,” etc.). Automated assistants often receive natural language input (utterances) from users. The natural language input can in some cases be received as audio input (e.g., streaming audio) and converted into text and/or received as textual (e.g., typed) natural language input. Automated assistants respond to natural language input with responsive content (e.g., visual and/or audible natural language output). However, it may often be the case that automated assistants do not accept and/or respond to requests that are based on sensor data (e.g., image(s)) that captures one or more properties of an environmental object.