Convolutional neural networks (CNNs) are powerful models that can be configured to be well suited for computer vision problems. CNNs typically perform best when they are large (i.e., more complex), meaning that they have more, deeper and highly-interconnected layers. A primary drawback of these CNNs is computational cost. Thus, large CNNs are typically impractically slow. In many applications, a large CNN demands more computation than is currently available on a serial computer.
Complex CNNs can, therefore, be implemented by parallelizing the network across multiple processors. For example, for image processing or classification tasks, a CNN could be implemented on several graphics processing units (GPUs).
There have been various proposals to increase the practicality of CNNs by means of parallelizing the CNN across several processors. Such approaches partition the network into parallel subnetworks in such a way as to minimize communication cost.
A first approach naively partitions the network into parallel subnetworks and communicates the state of every subnetwork's layer to all other subnetworks. This approach may be applied to both CNNs and fully connected networks.
In this approach, the network is partitioned into some number of parallel subnetworks in some manner. The subnetworks communicate their activations to all other subnetworks at every layer, which results in a parallel implementation of the feedforward neural network.
In certain implementations, however, this approach is inefficient with CNNs. Its efficiency is best for fully connected weight matrices because the amount of computation required by such matrices causes the communication-to-computation ratio to be small.
In contrast, CNN weight matrices are much sparser, so their communication-to-computation ratio is much larger. As a result, when applying this approach to CNNs, a large fraction of time is spent on communication, which makes the parallelization less useful.
A second approach partitions the network into slices that communicate with their neighbors, and is commonly only applied to convolutional or to locally connected networks. However, current implementations of this approach typically handle pooling inefficiently. Pooling is a technique for making the network's activations more invariant to small translations. While pooling increases the accuracy of the CNN, it changes the dimensions of the activation tensor in such a way that typically allows for less parallelism and requires increased communication for the second approach.
For example, one particular implementation of the second approach parallelizes CNNs into slices that communicate only with their neighbours. The approach partitions the input tensor (of size N×N×u) into m subtensors of size (N/m)×N×u and allocates a computing node to each of the m subtensors. This is efficient only when N is large and u is small, because a large N allows m, and thus the number of computing nodes, to be large, and a small u allows the neighbouring slices to not communicate much. However, when pooling is used, N is necessarily small and u is necessarily large. Since m cannot exceed N, a small N restricts the number of computing nodes which limits the attainable acceleration, while a large u requires more communication between neighbouring slices which increases the communicaiton cost.
It is an object of the following to obviate or mitigate at least one of the foregoing issues.