Deep learning (DL) is a branch of machine learning and artificial neural network based on a set of algorithms that attempt to model high level abstractions in data by using a deep graph with multiple processing layers. A typical DL architecture can include many layers of neurons and millions of parameters. These parameters can be trained from large amount of data on fast GPU-equipped computers, guided by novel training techniques that can work with many layers, such as rectified linear units (ReLU), dropout, data augmentation, and stochastic gradient descent (SGD).
Among the existing DL architectures, convolutional neural network (CNN) is one of the most popular DL architectures. Although the idea behind CNN has been known for more than 20 years, the true power of CNN has only been recognized after the recent development of the deep learning theory. To date, CNN has achieved numerous successes in many artificial intelligence and machine learning applications, such as image classification, image caption generation, visual question answering, and automatic driving cars.
However, the complexities of existing CNN systems remain quite high. For example, one of the best-known large-scale CNNs, AlexNet includes 60 millions of parameters and requires over 729 million FLOPs (FLoating-point Operations Per Second) to classify a single image, and the VGG network from Oxford University includes 19 layers and 144 millions of parameters, and the number of FLOPS involved to classify a single image is 19.6 billion. Unfortunately, implementing such high-complexity networks would often require significant amount of expensive hardware resources, such as Nvidia™ GPU cores and large-scale FPGAs. For example, the Nvidia™ TK1 chips used in some CNN systems cost at least US$80/chip, while the more powerful Nvidia™ TX1 chips cost at least US$120/chip. The cost issue can be particularly troublesome for embedded systems which often are subject to cost constraints. Moreover, these chips also require significantly more power to operate than the power constraints of traditional embedded platforms, thereby making them unsuitable for many applications where basic deep learning functionalities are required, but at the same time certain constraints on cost and power consumption should also be met.
Consequently, the existing CNN architectures and systems are not cost-effective for many embedded system applications.