Image recognition techniques oftentimes are used to locate, identify, and/or verify one or more subjects appearing in an image or in a video. Some image recognition techniques involve extracting a set of landmarks or features from an image, and comparing the extracted set of landmarks or features with corresponding features extracted form one or multiple other images in order to identify or verify the image. For example, in face recognition, one or more traits may be extracted from an image of a face, such as position, size and/or shape of the eyes, nose, cheekbones, etc. in the face, and these extracted traits may be compared with corresponding traits extracted from one or more other images to verify or to identify the face.
As compared to subject recognition based on a single image such as a photograph, video recognition typically involves analyzing more information that may be available for the subject in multiple frames of a video. For example, a face in a video may appear in various poses and illumination conditions across different frames of the video. In some video subject recognition systems, information across multiple frames of a video is integrated into a visual representation of a subject in the video, and the visual representation is then analyzed to verify or identify the subject in the video. For example, a face in a video may be represented by sets of features extracted from respective frames of the video. Such visual representation may comprehensively maintain information across multiple frames of the video. However, subject recognition in such systems is generally computationally intensive because multiple pairs of frames of respective videos must be compared, and multiple matching results must be analyzed. Thus, for example, a comparison of two videos each having n frames has computational complexity of O(n2), which is not desirable in many situations. Moreover, maintaining respective sets of features extracted from multiple frames of a video generally requires high degrees of storage and indexing complexity as well.
To reduce computational and storage complexity, some systems aggregate information corresponding to multiple frames of a video, such as respective sets of features extracted from the multiple frames of the video, to generate an aggregated representation of the video, and perform recognition analysis based on the aggregated representation of the video. Various pooling techniques have been employed to aggregate respective sets of features extracted from multiple frames of a video. For example, average pooling or max pooling has been used to combine multiple sets of features extracted from frames of a video. As another example, a more general feature encoding scheme, such as Fisher Vector coding, has been also employed. Such aggregation techniques, however, result in a less accurate representation of the subject in the video, and leads to less accurate or incorrect identification and/or verification of the subject in the video.