Machine learning, sometimes referred to as deep learning, can be used for a variety of useful applications related to data understanding, detection, and/or classification including image classification, optical character recognition (OCR), object recognition, action recognition, speech recognition, and emotion recognition.
A particular application is generating captions to describe images, both subjects and objects in the images and what they are doing. Indeed, scene understanding is an important goal of today's computer vision. Human beings have the capability of comprehending visual scenes completely in a short time. The target of scene understanding is to enable the machine to see and understand the visual scenes as human beings. Image captioning requires the machine to automatically understand the given image and generate a natural language description. In this way, the description can be presented visually or aurally to aid people, both who may have perception problems and those who do not.
Image captioning has been a challenging problem due to the fact that to generate a reasonable description of a given image, a machine must capture the key visual aspects of the image which has a set of unstructured objects and express the scene with human understandable natural language. Gaming image captioning in particular is challenging because there are no available image caption datasets for games.