Machine learning is an application of artificial intelligence. In machine learning, a computer or computing device is configured to think like human beings so that the computer may be taught to learn on its own. The development of neural networks has been key to training computers via a deep learning model to understand the world in the way human beings do.
In recent years, with the advancement of digital camera, smart cell phone and digital video recording devices, the advancement of imaging processing and streaming techniques, the vast and economic availability of digital storage spaces and the wide spread of internet availability, enormous amount of digital generated images is readily available for online marketing, social media, educational, and medical purposes. Applications generated for the above-mentioned areas often require to have the story behind the image when an image is presented in order to enhance the presentation. Although it is prone for human errors and the story varies from person to person, for a relatively small number of images, it is possible for human beings to carry the task of extracting stories out of the available images. However, when the number of images is massive in area of on-line marketing, educational, medical or even social media, it is much economic and efficient to use artificial intelligence devices to carry out the task of extracting stories out of images.
Image caption has many applications in real world. Prior approaches usually involve deep learning models to process image and text separately. Firstly, the input image is fed into a deep learning model (generally to convolutional neural network based) to extract features; then the extracted feature vector is concatenated with the word embeddings of current partial predicted sentence; and at last, the concatenation of vector input is fed into another deep learning model (generally recurrent neural network based) to predict the next token, i.e. words or characters. The final output would be a concatenation of the string of tokens to form a sentence as the image caption.