Visual question answering (VQA) is a benchmark to test for context-specific reasoning about complex images. One aspect of visual question answering is related to the answering of counting questions (also known as “How Many” questions) that are related to identifying distinct scene elements or objects that meet some criteria embodied in the question and counting the objects.
Accordingly, it would be advantageous to have systems and methods for counting the objects in images that satisfy a specified criteria.
In the figures, elements having the same designations have the same or similar functions.