In many environments, there is a need to identify regions of interest within an image. For example, in an airport, an image generated from a scan of a bag may need to be analyzed to determine whether the bag contains any prohibited objects (i.e., regions of interest). As another example, in a medical environment, an image generated from a scan of a patient may need to be analyzed to determine whether the patient has a tumor. The scanning technology may be computed tomography (“CT”), and the images may be three-dimensional (“3D”) images.
Convolutional neural networks (“CNNs”) are a type of neural network that has been developed specifically to process images. A CNN may be used to input an entire image and output a classification of the image. For example, a CNN can be used to automatically determine whether a scan of a patient indicates the presence of a tumor. A CNN has multiple layers such as a convolution layer, a rectified linear unit (“ReLU”) layer, a pooling layer, a fully connected (“FC”) layer, and so on. Some more complex CNNs may have multiple convolution layers, ReLU layers, pooling layers, and FC layers.
A convolution layer may include multiple filters (also referred to as kernels or activation functions). A filter inputs a convolution window of an image, applies weights to each pixel of the convolution window, and outputs an activation value for that convolution window. For example, if the image is 256 by 256 pixels, the convolution window may be 8 by 8 pixels. The filter may apply a different weight to each of the 64 pixels in a convolution window to generate the activation value also referred to as a feature value. The convolution layer may include, for each filter, a node (also referred to a neuron) for each pixel of the image assuming a stride of one with appropriate padding. Each node outputs a feature value based on a set of weights for the filter that are learned during a training phase for that node. Continuing with the example, the convolution layer may have 65,536 nodes (256*256) for each filter. The feature values generated by the nodes for a filter may be considered to form a convolution feature map with a height and width of 256. If an assumption is made that the feature value calculated for a convolution window at one location to identify a feature or characteristic (e.g., edge) would be useful to identify that feature at a different location, then all the nodes for a filter can share the same set of weights. With the sharing of weights, both the training time and the storage requirements can be significantly reduced. If each pixel of an image is represented by multiple colors, then the convolution layer may include another dimension to represent each separate color. Also, if the image is a 3D image, the convolution layer may include yet another dimension for each image within the 3D image. In such a case, a filter may input a 3D convolution window.
The ReLU layer may have a node for each node of the convolution layer that generates a feature value. The generated feature values form a ReLU feature map. The ReLU layer applies a filter to each feature value of a convolution feature map to generate feature values for a ReLU feature map. For example, a filter such as max(0, activation value) may be used to ensure that the feature values of the ReLU feature map are not negative.
The pooling layer may be used to reduce the size of the ReLU feature map by downsampling the ReLU feature map to form a pooling feature map. The pooling layer includes a pooling function that inputs a group of feature values of the ReLU feature map and outputs a feature value. For example, the pooling function may generate a feature value that is an average of groups of 2 by 2 feature values of the ReLU feature map. Continuing with the example above, the pooling layer would have 128 by 128 pooling feature map for each filter.
The FC layer includes some number of nodes that are each connected to every feature value of the pooling feature maps. For example, if an image is to be classified as being a cat, dog, bird, mouse, or ferret, then the FC layer may include five nodes whose feature values provide scores indicating the likelihood that an image contains one of the animals. Each node has a filter with its own set of weights that are adapted to the type of the animal that the filter is to detect.
Although CNNs are effective at classifying images, the training of a CNN can be computationally very expensive. For example, a high-resolution 3D image may contain 134,217,728 pixels (i.e., 5123 pixels). If the convolution layer includes eight filters, then it will include 1,073,741,824 nodes During training, each image of the training data will need to be input to the CNN multiple times and convolution windows processed for each image as the weights for the convolution layer and the FC layer are learned. Because the number of degrees of freedom and the number of images are typically very large, it would be impracticable to train such a CNN. CNNs may also have some difficulty when identifying both large and small objects within an image. For example, if a convolution window is made large to help detect large tumors, it may be difficult to also detect small tumors. It would be desirable to reduce the computational expense of training a CNN. It would also be desirable to have a CNN that could effectively and efficiently classify both large and small objects within an image.