Despite advances in computer vision, image processing, and machine learning, recognizing visual objects remains a task where computers fail in comparison with the capabilities of humans. Recognizing an object from an image not only requires recognizing the object in a scene but also recognizing objects in various positions, in different settings, and with slight variations. For example, to recognize a chair, the innate properties that make a chair a chair must be understood. This is a simple task for a human. Computers struggle to deal with the vast variety of types of chairs and the situations in which a chair may be present. Models capable of performing visual object recognition must be trained to provide explanations for visual datasets in order to recognize objects present in those visual datasets. Unfortunately, most methods for training such models either fall short in performance and/or require large training sets.
This issue is not confined solely to visual object recognition, but more generally applies to pattern recognition, which may be used in speech recognition, natural language processing, and other fields. Thus, there is a need in the artificial intelligence field to create new and useful systems and methods for teaching compositionality to convolutional neural networks.