Image classification is a fundamental problem in computer vision. Broadly speaking, image classification attempts to extract semantic information from an image so that the image can be labeled to describe the content of the image. Semantic information can include, for instance, objects depicted in an image (and locations of the image at which the objects are depicted), scenes depicted in an image (e.g., whether the image depicts a beach or a sunset), moods associated with human faces or facial expressions depicted in an image, image aesthetics (e.g., good composition, poor composition, obeys the rule-of-thirds, and so on), image sentiment (e.g., fear, anger, and the like), and so forth.
Some conventional image classification techniques categorize images into fixed sets of classes representative of semantic information by training a multi-class classifier. However, because semantic relationships between classes can be complex (e.g., hierarchical, disjoint, etc.), it is difficult to define a classifier that encodes many of the semantic relationships. To address these shortcomings, visual-semantic embedding techniques have been developed. Conventional visual-semantic embedding techniques leverage semantic information from unannotated text data to learn semantic relationships between text labels and explicitly map images into a rich semantic embedding space. These conventional visual-semantic embedding techniques are limited to annotating images with a single text label, however. Accordingly, conventional techniques for automatically associating text labels with images to describe their content are inadequate for some image labeling tasks.