Field
Certain aspects of the present disclosure generally relate to machine learning and, more particularly, to improving systems and methods of considering the spatial relationship of an input when processing a multi-dimension hidden state and a multi-dimension attention map.
Background
An artificial neural network, such as an artificial neural network with an interconnected group of artificial neurons (e.g., neuron models), may be a computational device or may be a method to be performed by a computational device.
A convolutional neural network (CNN) refers to a type of feed-forward artificial neural network. Convolutional neural networks may include collections of neurons, each neuron having a receptive field and also collectively tiling an input space. Convolutional neural networks may be used for pattern recognition and/or input classification.
Recurrent neural networks (RNNs) refer to a class of neural network, which includes a cyclical connection between nodes or units of the network. The cyclical connection creates an internal state that may serve as a memory that enables recurrent neural networks to model dynamical systems. That is, cyclical connections offer recurrent neural networks the ability to encode memory. Thus, if successfully trained, recurrent neural networks may be specified for sequence learning applications.
A recurrent neural network may be used to implement a long short-term memory (LSTM). For example, the long short-term memory may be implemented in a microcircuit including multiple units to store values in memory using gating functions and multipliers. A long short-term memory may hold a value in memory for an arbitrary length of time. As such, long short-term memory may be useful for learning, classification systems (e.g., handwriting and speech recognition systems), and/or other applications.
In conventional systems, a recurrent network, such as a recurrent neural network, is used to model sequential data. Recurrent neural networks may handle vanishing gradients. Thus, recurrent neural networks may improve the modeling of data sequences. Consequently, recurrent neural networks may improve the modelling of the temporal structure of sequential data, such as videos.
Still, in conventional recurrent neural networks (e.g., standard RNNs), input dimensions are treated equally, as all dimensions equally contribute to the internal state of the recurrent neural network unit. For sequential temporal data, such as videos, some dimensions are more important than others. An important area may refer to an area with action, an object, or an event. Moreover, at different times, different dimensions may be more important than other dimensions. For example, in a video with action, the locations with the action may be specified to have a greater weight and an increased contribution to the internal state of the recurrent neural network in comparison to locations without action. Therefore, conventional systems have proposed an attention recurrent neural network model that predicts an attention saliency vector for the input that weighs different dimensions according to their importance. Although an attention recurrent neural network weighs different dimensions according to their importance, there is also a need to consider the spatial dimensions of sequential data.