The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology.
Image captioning is drawing increasing interest in computer vision and machine learning. Basically, it requires machines to automatically describe the content of an image using a natural language sentence. While this task seems obvious for human-beings, it is complicated for machines since it requires the language model to capture various semantic features within an image, such as objects' motions and actions. Another challenge for image captioning, especially for generative models, is that the generated output should be human-like natural sentences.
Recent successes of deep neural networks in machine translation have catalyzed the adoption of neural networks in solving image captioning problems. The idea originates from the encoder-decoder architecture in neural machine translation, where a convolutional neural network (CNN) is adopted to encode the input image into feature vectors, and a sequence modeling approach (e.g., long short-term memory (LSTM)) decodes the feature vectors into a sequence of words.
Most recent work in image captioning relies on this structure, and leverages image guidance, attributes, region attention, or text attention as the attention guide. FIG. 2A shows an attention leading decoder that uses previous hidden state information to guide attention and generate an image caption (prior art).
Therefore, an opportunity arises to improve the performance of attention-based image captioning models.
Automatically generating captions for images has emerged as a prominent interdisciplinary research problem in both academia and industry. It can aid visually impaired users, and make it easy for users to organize and navigate through large amounts of typically unstructured visual data. In order to generate high quality captions, an image captioning model needs to incorporate fine-grained visual clues from the image. Recently, visual attention-based neural encoder-decoder models have been explored, where the attention mechanism typically produces a spatial map highlighting image regions relevant to each generated word.
Most attention models for image captioning and visual question answering attend to the image at every timestep, irrespective of which word is going to be emitted next. However, not all words in the caption have corresponding visual signals. Consider the example in FIG. 16 that shows an image and its generated caption “a white bird perched on top of a red stop sign”. The words “a” and “of” do not have corresponding canonical visual signals. Moreover, linguistic correlations make the visual signal unnecessary when generating words like “on” and “top” following “perched”, and “sign” following “a red stop”. Furthermore, training with non-visual words can lead to worse performance in generating captions because gradients from non-visual words could mislead and diminish the overall effectiveness of the visual signal in guiding the caption generation process.
Therefore, an opportunity arises to determine the importance that should be given to the target image during caption generation by an attention-based visual neural encoder-decoder model.
Deep neural networks (DNNs) have been successfully applied to many areas, including speech and vision. On natural language processing tasks, recurrent neural networks (RNNs) are widely used because of their ability to memorize long-term dependency. A problem of training deep networks, including RNNs, is gradient diminishing and explosion. This problem is apparent when training an RNN. A long short-term memory (LSTM) neural network is an extension of an RNN that solves this problem. In LSTM, a memory cell has linear dependence of its current activity and its past activity. A forget gate is used to modulate the information flow between the past and the current activities. LSTMs also have input and output gates to modulate its input and output.
The generation of an output word in an LSTM depends on the input at the current timestep and the previous hidden state. However, LSTMs have been configured to condition their output on auxiliary inputs, in addition to the current input and the previous hidden state. For example, in image captioning models, LSTMs incorporate external visual information provided by image features to influence linguistic choices at different stages. As image caption generators, LSTMs take as input not only the most recently emitted caption word and the previous hidden state, but also regional features of the image being captioned (usually derived from the activation values of a hidden layer in a convolutional neural network (CNN)). The LSTMs are then trained to vectorize the image-caption mixture in such a way that this vector can be used to predict the next caption word.
Other image captioning models use external semantic information extracted from the image as an auxiliary input to each LSTM gate. Yet other text summarization and question answering models exist in which a textual encoding of a document or a question produced by a first LSTM is provided as an auxiliary input to a second LSTM.
The auxiliary input carries auxiliary information, which can be visual or textual. It can be generated externally by another LSTM, or derived externally from a hidden state of another LSTM. It can also be provided by an external source such as a CNN, a multilayer perceptron, an attention network, or another LSTM. The auxiliary information can be fed to the LSTM just once at the initial timestep or fed successively at each timestep.
However, feeding uncontrolled auxiliary information to the LSTM can yield inferior results because the LSTM can exploit noise from the auxiliary information and overfit more easily. To address this problem, we introduce an additional control gate into the LSTM that gates and guides the use of auxiliary information for next output generation.
Therefore, an opportunity arises to extend the LSTM architecture to include an auxiliary sentinel gate that determines the importance that should be given to auxiliary information stored in the LSTM for next output generation.