Perceptual computing and image understanding seek to analyze a scene captured by one or more cameras, identify objects in the scene, and then understand the actions and interactions of the objects. These technologies are being used in robotics, virtual reality, augmented reality, and for many other applications. Inexpensive, small image sensors have allowed for devices and machines to be outfitted with more cameras in more places and for depth cameras to be deployed to determine distances.
Using all of these cameras, computer vision systems seek to reason about natural images. Given an input image, a computer vision system performs a function that outputs some high-level information about the image. For example the system may distinguish objects and backgrounds, identify particular objects in the image, locate the objects in the scene, etc. Convolutional neural networks (CNN) form the basis for a number of successful systems that solve problems like image classification and object localization in images. A CNN is a complex function that works from a space of images to an output that is specific to the task. For example a function from an image to the type of scene that is seen in the image, such as a kitchen, living room, outdoor scene, etc.
Many techniques for perceptual computing and image understanding use a convolutional neural network (CNN) to find patterns, identify objects, track objects as they move, and for other purposes. CNNs for two-dimensional images have been in use for years. When using a three-dimensional image, that is an image that includes distances from the observer or distances between objects, the CNN techniques become more complex.
One approach to a 3D CNN is to analyze renderings of an object from multiple views. In this case, at runtime the object is rendered from multiple views and then a CNN is applied to all of these views. This method requires far more images to be analyzed and still may not be able to combine all of the necessary information that is required in order to make a useful inference about the object. The inferences are limited because the multiple views see the same part of an object but from different points of view and therefore the network must learn to recognize and understand that it is the same object and how to relate the different views.
Another approach is to represent the 3D object in a regular 3D volumetric grid and apply standard 3D convolution and pooling. These techniques are identical to 2D convolution and pooling, but with the additional dimension of depth. This solution uses significant memory and computational resources. For a typical object represented in a grid of N3 cells, there are only N2 non-empty cells.