The domain of Machine Learning (ML) has progressed by leaps and bounds in the last decade. Researchers are especially interested in applying the concepts of ML to solve the problem of object recognition. Many of the proposed machine-learning solutions are inspired by the complex neural processing capability of the human brain. A Convolution Neural Network (CNN) (also referred to as a CNN algorithm, described hereinafter in more detail with reference to FIG. 17) is an example of such a system, which has exhibited human like accuracy in relation to object recognition. CNNs are typically implemented using layers of interconnected processing neurons (also referred to as Processing Units or PUs or accelerators). Given the aforementioned high accuracy, CNNs have been used in some cutting-edge applications such as video surveillance, autonomous driving/navigation and large scale image search engines. It anticipated that CNN algorithms will be part of various embedded system products such as digital single-lens reflex (DSLR) cameras, mobile phones and other hand-held products.
CNNs emulate the human neural system by processing input image data through layers of strategically connected processing neurons. The layers use pre-calculated coefficients to transform the input data, thus extracting very specific features from the image. The number of coefficients and the amount of intermediate data (ie data produced at the end of each layer) can be huge, thus making the execution of CNN algorithms both computationally and memory intensive. Exacerbating this issue is the fact that in order to improve the accuracy of CNNs even further, researchers have proposed using deep learning algorithms that use even higher numbers of processing layers.
Research studies have shown that general purpose computing machines are not efficient for implementing CNN algorithms. Graphical Processing Units (GPUs) are a strong candidate for implementing CNN algorithms because GPUs, which are suitable for parallel computation, are well adapted to exploit the high level of data parallelism in the CNN algorithms. However, GPUs are not suitable for integration in low-power, low-cost embedded systems. Therefore, researchers have proposed various application-specific accelerators for use as neurons (ie PUs) when implementing CNN algorithms, proposing both Field Programmable Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC) based multi-accelerator implementations.
FIG. 17 depicts an example 1700 of how CNN algorithms may be used in the applications referred to above, in order to introduce terminology used in the present description. In the example 1700 it is desired to process an image 1702 in order to extract a number of features using a CNN algorithm 1703.
The CNN algorithm (also referred to simply as a CNN) is made up of a number of layers 1704, 1705, . . . , 1706 of feature maps 1716. Feature maps in each layer are connected, as depicted by an arrow 1717, to feature maps of a subsequent layer. The number of connections in a particular CNN algorithm depends on the behaviour of the CNN algorithm in question. For example, in one CNN algorithm all the feature maps in a layer will be connected to all the feature maps of a subsequent layer. In a different CNN algorithm the first top half of the features maps in a layer will be connected to all the top half feature maps of a subsequent layer and the bottom half of the feature maps in a layer will be connected to all the bottom half features maps of a subsequent layer. The CNN algorithm 1703 has N layers, the last (ie Nth) layer of which produces the desired outputs 1707.
A process 1701 (also referred to as CNN process) comprises a sequence of process steps which, when embodied on a multi-accelerator System on a Chip (SoC) device or platform 1714 for example, execute the processing operation represented by the CNN 1703 to produce the outputs 1707 from the input 1702. In order to embody the process 1701 on the SoC 1714 it is necessary to generate, as depicted by an arrow 1719, based upon the process 1701 and applicable memory operations based on the memory architecture of the SoC platform 1714, a set 1708 of scheduling schemes each of which is mapped to (i.e. is identified as being suitable for or even optimal for use in executing) a respective layer of the SoC 1714. Thus for example, in FIG. 17 the scheduling scheme 1722 is mapped, as depicted by a dashed arrow 1723, to the processing unit (PU) 1711 of the SoC indicating that the PU 1711of the SoC executes the scheduling scheme 1722 as indicated by the mapping 1723.
Accordingly, a scheduling scheme such as 1722 sends its set of operations to the available PU 1714 to process data in parallel and produce the output feature map such as 1705. Neighbouring layers of the CNN algorithm are processed together. That is, one layer of CNN algorithm (such as 1704) is received as an input, processed by the PUs (such as 1711,1712, . . . , 1721) of the SoC 1714 which will then produce feature maps of the next layer of the CNN algorithm as output (such as 1705). The produced layer (such as 1705) is then used as an input to generate feature maps of the subsequent layer (such as 1706) of the CNN algorithm using the available set of PUs in the SoC 1714.
The SoC 1714 is made up of a number of processing units (PUs) such as 1711, 1712, . . . , 1713 and 1721. The PUs in the SoC can be connected in any fashion or not connected at all (an example platform is depicted in 1714 where the PUs are connected with a forward link to the subsequent PUs). In general, there is no correspondence between the number of layers in the CNN 1703 and the number of PUs in the SoC 1714. Furthermore, in general there is no correspondence between the interconnections 1717 in the CNN 1703 and the interconnections 1718 in the SoC 1714. The CNN 1703 in FIG. 17 has N layers, the last (i.e. Nth) layer of which produces the desired outputs 1707. Each PU such as 1721 may have an associated local memory module 1720. The SoC may also have an associated shared memory module 1710 whose storage capacity is shared by the PUs in the SoC. In one example, local memory modules such as 1720 may constitute distributed shared memory. The SoC may also have an external memory module 1709 whose storage capacity is accessible by the PUs in the SoC.
As with all embedded systems, multi-accelerator designers are challenged to maximise the performance of these accelerators, while adhering to area, power and other design constraints. The high volume of data and the large number of computational steps involved in executing a CNN algorithm make the task of mapping the process (such as 1701) associated with the CNN (such as 1703) into such a multi-accelerator based System-on-Chip (SoC) such as 1714 even more difficult. There are numerous CNN algorithms such as 1703, and there are number of ways that the process such as 1701 associated with the CNN algorithms such as 1703 can be mapped to accelerator hardware such as 1714.
Scheduling schemes such as 1708, also referred to in this specification as memory schedules or merely as schedules, each of which includes a sequence of operations for executing a particular layer such as 1704 of the CNN algorithm on an associated PU (or associated set of PUs) such as 1712 of the SoC, are created 1719 based upon the CNN algorithm 1703 for execution on the multi-accelerator SoC 1714. The operations embodied in the scheduling schemes such as 1722 can be computation operations and/or memory operations. For example, “convolution” is a computation operation and “read from DRAM into SRAM” is a memory operation. The term “accelerator” and “Processing Unit (PU)” will be used interchangeably in this specification. PUs are also known as “Processing Element (PE)” in the industry. A unique combination of computation and communication sequences for executing a layer forms a scheduling scheme.
One prior-art approach for implementing a CNN algorithm such as 1703 on an SoC such as 1714 is to select a particular scheduling scheme such as 1722 for the entire CNN algorithm 1703 using design space exploration to determine the appropriate scheduling scheme. The same scheduling scheme is then applied to the PUs 1711, 1712, . . . , 1713 of the SoC. Since different layers 1704, 1705, . . . , 1706 in a CNN algorithm are different in terms of sizes and parameters, choosing one particular scheduling scheme such as 1722 may be suboptimal.
Another prior-art approach is to exhaustively explore and simulate all possible scheduling schemes 1708 against all the layers of the CNN algorithm 1701 This approach is time consuming and is typically not feasible within a reasonable time if the CNN algorithm is large.
In one known method, accelerators such as 1721 are customised for each layer such as 1704 of the CNN algorithm 1703 based on the level of unrolling and pipelining necessary to match the computation and communication demand of the CNN layer 1704. Since complex hardware structures are required to configure the accelerators such as 1721, uniform unroll factors are generally preferred for all the layers of the CNN algorithm.
In another known method, loop bounds for processing the CNN algorithm are determined based on the size of the given buffers such as 1720 in the SoC 1714, to reduce accesses to the external memory 1709. The utilised scheduling schemes have parameterisable loop bounds for each layer of the CNN algorithm, but have the same operations sequence.
In another known method, optimal buffer sizes for buffers such as 1709 are determined for each layer such as 1704, 1705, . . . 1706 of the CNN algorithm using the same scheduling scheme for all the layers 1704, 1705, . . . 1706. Selecting the same schedule such as 1722 for all the layers in the CNN algorithm is suboptimal with respect to reducing design costs such as external memory accesses (term “DRAM accesses” will be interchangeably used with the term “external memory accesses”), execution time of the CNN algorithm and the size of the local memory such as 1720 (term “SRAM size” will be interchangeably used with the term “local memory size”).
Finding the best scheduling schemes 1708 for the entire CNN algorithm 1703 in a feasible time frame can have a significant impact on the overall exploration time, which greatly impacts the design efficiency and time to market of the embedded system.