The present invention relates to image processing, and more specifically, to object recognition in a video stream. Artificial neural networks, such as Convolutional Neural Networks (CNNs) are often used when performing image classification. A common setup for CNNs trained for image classification is to have the last layer of the CNN contain as many outputs as there are classes. This layer is referred to herein as a classification vector.
Typically, a multinomial logistic regression function, such as a Softmax function is applied to the classification vector, normalizing the values so that they sum to one, thereby creating a probability distribution where each probability (confidence) corresponds to a certain class. The i:th confidence value is then interpreted as the probability that the image depicts an object of class i.
To increase the accuracy of CNN classifiers, it is common to feed several crops of an object from the same image to the classifier, using the mean output as a final confidence vector for the object. While such methods may be suitable for individual images, the classification accuracy improvement is limited. Thus, there is a need to improve accuracy for the classification, and more precisely to provide techniques that improve the classification accuracy in a video stream, in computationally efficient ways.