Machine Learning (ML) has progressed by leaps and bounds in the last decade. Researchers are especially interested in applying the concepts of ML to solve the problem of object recognition. Many of the proposed machine-learning solutions are inspired by the complex neural processing capability of the human brain. A CNN (also referred to as a CNN process) described hereinafter in more detail with reference to FIG. 13, is an example of such a system which has exhibited human like accuracy in relation to object recognition. CNNs are typically depicted in the form of interconnected layers of feature maps (eg 1304 and 1305 in FIG. 13) and can be implemented using interconnected Processing Units ie PUs (also referred to as Processing Elements ie PEs or “accelerators”) which can, for example, be fabricated on a System on Chip (SoC) such as 1314 (also referred to as a CNN accelerator SoC) in FIG. 13. Given the aforementioned high accuracy, CNNs have been used in some cutting-edge applications such as video surveillance, autonomous driving/navigation and large scale image search engines. It is anticipated that CNN processes will be part of various embedded system products such as digital single-lens reflex (DSLR) cameras, mobile phones and other hand-held products.
CNNs emulate the human neural system by processing input image data through interconnected layers. The layers use pre-determined coefficients to transform the input data, thus extracting specific features from the image. The number of coefficients and the amount of intermediate data (i.e. data produced at the output of each layer) can be very large, thus making the execution of CNN processes both computationally and memory intensive. Exacerbating this issue is the fact that in order to improve the accuracy of CNNs even further, researchers have proposed using deep learning algorithms that use even higher numbers of layers.
Research studies have shown that general purpose computing machines are not efficient for implementing CNN processes. Graphical Processing Units (GPUs) are a strong candidate for implementing CNN processes because GPUs, which are suitable for parallel computation, are well adapted to exploit the high level of data parallelism typically present in CNN processes. However, GPUs are not suitable for integration in low-power, low-cost embedded systems. Therefore, researchers have proposed various application-specific accelerators for use as PUs when implementing CNN processes, proposing both Field Programmable Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC) based multi-accelerator implementations.
However, as with all embedded systems, designers are challenged to maximise the performance of these accelerators, while adhering to area, power and other design constraints. Typical SoCs designed with CNN accelerators contain a number of processing units (PUs). The SoC may also have an associated on-chip shared memory module whose storage capacity is shared by the PUs in the SoC. Because of the large volume of data involved in CNN processing, the SoC is also typically interfaced with external memory such as DRAM. The cost (both in terms of energy and execution time) of accessing an external memory is much higher than accessing an on-chip shared memory. Therefore, it is often required to maximise the use of on-chip shared memory while minimizing the accesses made to an external memory.
CNN Process
FIG. 13 depicts an example 1300 of how CNN processes may be used in the applications referred to above, in order to introduce terminology used in the present description. In the example 1300 it is desired to process an image 1302 in order to extract a number of features using a CNN process 1303.
The CNN process 1303 (also referred to simply as a CNN) is made up of a number of layers 1304, 1305, . . . , 1306 of feature maps (FMs) such as 1316. Feature maps in each layer are interconnected, as depicted by an arrow 1317, to feature maps of a subsequent layer (the number of connections depends on the specific CNN process). For example, in one CNN process, all the feature maps in a layer are connected to all the feature maps of a subsequent layer. In a different CNN process, however, the top half of the features maps in a layer are connected to all the top half feature maps of a subsequent layer, and the bottom half of the feature maps in the layer are connected to all the bottom half features maps of the subsequent layer. The CNN process 1303 has N layers, the last (i.e. Nth) of which produces the desired outputs 1307.
A CNN implementation 1301 (also referred to as a CNN procedure) comprises a sequence of process steps which, when embodied on (ie programmed onto) a multi-accelerator SoC device or platform such as 1314 for example, executes the processing operation represented by the CNN 1303 in order to produce the outputs 1307 from the input 1302. In order to embody the CNN implementation 1301 on the SoC 1314 it is necessary to generate, as depicted by an arrow 1319, based upon the CNN implementation 1301 and applicable memory operations based on the memory architecture of the SoC platform 1314, a set 1308 of predetermined scheduling schemes (also known as schedulers or schedules) each of which is mapped to (i.e. is identified as being suitable for or even optimal for use in executing) a respective PU of the SoC 1314. Thus for example, in FIG. 13 the scheduling scheme 1322 is mapped, as depicted by a dashed arrow 1323, to the processing unit (PU) 1311 of the SoC indicating that the PU 1311 of the SoC executes the scheduling scheme 1322 as indicated by the mapping 1323.
Accordingly, a scheduling scheme such as 1322 maps its set of operations to an available PU such as 1311 which processes input data (such as the feature maps in the layer 1304) in parallel and produces output feature maps (such as the feature maps in the layer 1305). Neighbouring layers of the CNN process (such as 1304, 1305) are, in one example, processed sequentially. That is, one layer of the CNN process (such as 1304) is received as an input, processed by the PUs (such as 1311, 1312, 1313, . . . , 1321) of the SoC 1314 in accordance with the appropriate scheduling schemes 1308, which will then produce feature maps of the next layer of CNN process as output (such as 1305). The produced layer (such as 1305) is then used as an input to generate feature maps of the subsequent layer (such as 1306) of the CNN process using the available set of PUs in the SoC 1314.
The SoC 1314 is made up of a number of processing units (PUs) such as 1311, 1312, . . . , 1313 and 1321. PUs in the SoC can be connected in any fashion or not connected at all (an example platform is depicted in 1314 where the PUs are connected with a forward link to the subsequent PUs). In general, there need be no correspondence between the number of layers in the CNN 1303 and the number of PUs in the SoC 1314. Furthermore, in general there need be no correspondence between the interconnections 1317 in the CNN 1303 and the interconnections such as 1318 in the SoC 1314. The CNN 1303 in FIG. 13 has N layers, the last (i.e. Nth) layer of which produces the desired outputs 1307.
Each PU such as 1321 may have an associated local (ie on-chip) memory module 1320 (also commonly referred to as on-chip memory or SRAM). The SoC 1314 may also have an associated on-chip shared memory module 1310 whose storage capacity is shared by the PUs in the SoC. In one embodiment, local on-chip memory modules such as 1320 may constitute distributed shared memory (SM) where the PUs 1311, 1312, 1313, 1321 may share the memory available in memory module 1320 of PU 1321. The SoC may also have an external memory module (also commonly referred to as DRAM or DDR memory) 1309 whose storage capacity is accessible by the PUs in the SoC. For the purposes of this description, the term ‘on-chip memory’ refers to local on-chip memory 1315, 1320 and shared on-chip memory 1310, but does not refer to external memory modules such as 1309. The term ‘on-chip’ and local may be used interchangeably.
The cost (both in terms of energy and execution time) of accessing the external memory module 1309 is much higher than the cost of accessing the on-chip memory module 1310. Therefore, it is often required to maximise the use of an on-chip memory module such as 1310, 1315, 1320, while minimizing the accesses made to an external memory module such as 1309. In a SoC such as 1314, the use of shared on-chip memory modules such as 1310 is specified by the programmer. For example, while processing the CNN layer 1305, the module 1310 can be used to store (a) some input data generated from the layer 1304 (which is input to the layer 1305), or (b) output data generated from layer 1305, or both. The allocation of on-chip memory modules such as 1310 to input and output data is an important design decision as it impacts both execution time and energy consumption of the SoC executing the CNN application.
As with all embedded systems, multi-accelerator designers are challenged to maximise the performance of these accelerators such as 1311, while adhering to area, power and other design constraints of the SoC 1314. The high volume of data and the large number of computational steps involved in executing a CNN implementation such as 1301 make the task of mapping the CNN implementation (such as 1301) associated with the CNN process (such as 1303) into a multi-accelerator based SoC such as 1314 even more difficult. There are numerous CNN processes such as 1303, and there are a number of ways that the CNN implementation such as 1301 associated with the CNN processes such as 1303 can be mapped to accelerator hardware such as 1314. Furthermore, the optimal allocation of an on-chip memory module such as 1310 adds another dimension to the design problem.
Scheduling schemes such as 1308, each of which specifies a sequence of computational and memory operations for executing a particular layer such as 1304 of the CNN process on an associated PU (or associated set of PUs) such as 1312 of the SoC 1314, are created 1319 based upon the CNN process 1303, for execution on the multi-accelerator SoC 1314. The operations embodied in the scheduling schemes such as 1322 are typically computation operations and/or memory operations. For example, “convolution” is a computation operation and “read from external memory into on-chip memory” is a memory operation. The memory operations in the scheduling scheme such as 1322 depend on space allocation in the on-chip memory module such as 1310. A unique combination of computation and communication sequences for executing a layer such as 1304 forms a scheduling scheme.
One known method for implementing a CNN process such as 1303 on a SoC such as 1314 is to select a particular scheduling scheme such as 1322 for the entire CNN process 1303 using design space exploration to determine the appropriate scheduling scheme. The same scheduling scheme is then applied to the PUs 1311, 1312, 1313,1321 of the SoC. Since different layers 1304, 1305, . . . , 1306 in a CNN process are different in terms of sizes and parameters, choosing one particular scheduling scheme such as 1322 may be suboptimal.
In another known method, all possible scheduling schemes 1308 are exhaustively explored and simulated against all the layers of the CNN process 1303. This approach is time consuming and is typically not feasible within a reasonable time if the CNN process is large.
In another known method, accelerators such as 1321 are customised for each layer such as 1304 of the CNN process 1303 based on the level of unrolling and pipelining necessary to match the computation and communication demand of the CNN layer 1304. Since complex hardware structures are required to configure the accelerators such as 1321, uniform unroll factors are generally preferred for all the layers of the CNN process. Furthermore, this method only adds just enough memory to the system to act as buffer between PUs such as 1311, 1312, 1313, 1321 and an external memory module such as 1309. A disadvantage of this method is that unique hardware designs must be made for accelerators at each CNN level rather than using more generalised PUs, resulting in a large design and testing cost and increased chip area devoted to the many customised accelerators.
In another known method, loop bounds for processing the CNN process are determined based on the size of the given buffers such as 1320 in the SoC 1314, to reduce accesses to the external memory 1309. The utilised scheduling schemes have parameterisable loop bounds for each layer of the CNN process, but have the same operations sequence.
In another known method, optimal sizes for buffers such as 1309 are determined for each layer such as 1304, 1305, . . . 1306 of the CNN process using the same scheduling scheme for all the layers 1304, 1305, . . . 1306. Selecting the same schedule such as 1322 for all the layers in the CNN process is suboptimal with respect to reducing design costs such as external memory, execution time of the CNN process and the size of the on-chip memory such as 1320.
In another known method, a scheduling scheme from set of scheduling schemes such as 1308 is chosen for each CNN layer such as 1304, 1305, . . . 1306 on per layer basis such that external memory accesses are minimised. The selection of scheduling scheme for a layer, such as 1305, is dependent on a scheduling scheme selected for a previous CNN layer, such as 1304. The dependency is caused by the storage locations (such as the module 1309 or 1310) of data generated from a previous layer such as 1304, as this method selects a scheduling scheme for a layer, such as 1305, that can efficiently use the output data. The aforementioned dependency limits the set of scheduling schemes that can be applied to a certain layer such as 1304, 1305, . . . 1306. However in this method, it is assumed that each PU has adequate local memory, such that the intra-layer external memory accesses are small compared to the memory accesses required to transform the local memory maps between layers. In cost-constrained implementations, this assumption will typically not be realised.
In another known method, different loop transformations are applied to a multi-loop computer code and some of the data arrays are assigned to an on-chip memory module, such as 1310, to reduce accesses made to external memory such as 1309. This method uses a heuristic algorithm to decide on the preference for different data arrays to be completely stored in on-chip memory module such as 1310.
Finding the best scheduling schemes 1308 and best allocation of on-chip memory modules such as 1310 and external memory 1309 for the entire CNN process 1303 in a feasible time frame is a difficult problem, and can have a significant impact on the overall design exploration time, which greatly impacts the design efficiency and time to market of the embedded system.