Image searching applications, such as Adobe® Stock, rank images against queries, which may be natural language queries provided by users. Neural networks have been used to determine the similarities between queries and images. However, these approaches classify an overall image and therefore often overlook specific regions of the image which may be relevant to a query. Further, these approaches fail to account for the potentially rich metadata available for images, such as captions (e.g., titles), tags, keywords, descriptions, and the like. Neural networks have been used for web searches to map queries to the text of web documents at the semantic level. However, these approaches do not account for images or metadata of images that might be contained in those web documents.
Using conventional approaches to determining similarities between queries and images, a computer may be unable to accurately determine the relevance of certain images to a query. For example, where a query is “The golden gate bridge in San Francisco,” highly ranked images may include clear and sharp photos of San Francisco. However, other relevant images may be ranked low despite depicting the subject matter of the query. For example, conventional neural networks may have difficulty recognizing images that contain visual distortion, such as blur, as well as artistic renditions of the subject matter. Additionally, where the subject matter corresponds to a small portion of the image, it may be overlooked using conventional approaches.