Embedded systems are computer systems part of larger systems dedicated to execute several specific functions generally under real-time constraints. They are used for control as well as for data or signal processing. They are present in many application fields that are among others: telecommunications, automotive, industrial control, power conversion, military, avionics, aerospace, household appliance and consumer electronics. Examples of embedded systems are cell phone handsets and base stations, on board radars, network routers, modems, software-defined radio communication terminals, engine controllers, satellite flight computers, GPS positioning terminals, set-top boxes or digital cameras. Embedded systems are generally highly constrained due to their operating environment or reliability requirements. Engine control, oil drilling or military applications can have severe temperature requirements while avionics and aerospace are exposed to radiations. Cell phone handsets are constrained by battery autonomy and their base stations by cooling.
Embedded applications are part of products offering functionalities to end-users. Those functionalities are generally defined by roadmaps driving the market demand by providing more functionalities or increased throughput at each new product generation. This functionality enhancement at each new generation leads to more complex programs to be executed on embedded platforms but also to a wide variety of programs to be supported on a single platform. New standards require higher data throughput meaning more computation capabilities, they also use advanced algorithms to be able to reach higher throughput requirements. An example is the telecommunication standards evolution using simple phase shift keying waveforms for low data rate transmission to the far more complex multiple-input multiple-output orthogonal frequency-division multiplexing with adaptive channel capability for the highest throughput. The advanced application and algorithm support causes computation to be more complex. Indeed, with simple image and signal processing applications implemented in early standards, barely all the computation load is executed in small kernels having few instructions but a very high iteration count and very simple control paths. Those simple kernel-oriented algorithms allow to easily exploit a high-level of parallelism and are easy to implement in dedicated hardware accelerators. With new advanced and complex standards, the control part became important leading to important parts of sequential code difficult to parallelize. Furthermore, complex control paths can even be present inside high computational kernels making them difficult to implement using dedicated hardware. Another major shift is toward software-defined applications where standards are not fully defined by hardware implementations but are composed dynamically using software. The most advanced one is software-defined radio that copes with the large number of telecommunication standards. It aims to provide a standard software interface called by the application allowing to compose services dynamically to realize custom functions.
In summary, supporting future embedded applications requires to support more complex functionalities with higher computational throughput. It also requires high programmability capability to support advanced functions and sophisticated algorithms up to fully software-defined applications, all this under real-time constraints.
Embedded platforms dedicated to host embedded systems, are constraints by their environment. They are not limited by the computing capability of silicon chips since a one square centimeter silicon surface can already contain a desktop multicore processor. Embedded systems are severely constrained by their total power consumption. Indeed, most of them are battery powered with limited battery capacity and poor improvements at each new product generation. For systems that are not battery powered, heating caused by the system power consumption leads to cooling issues difficult to handle in integrated environments. It is for example the case with cell phone base stations that have to handle thousands of communications at the same time requiring a very intensive computation load while being integrated close to the antennas. In high temperature environments like for engine control in automotive applications the cooling capability is further limited. Due to those issues, power consumption is the main constraint that future embedded computing platform have to deal with.
Silicon technology used for embedded platforms implementation has also to face limitations. With technology shrink, the number of transistors doubles every new technology node about every 18 month to two years. A problem is that together with transistor shrink, there are only a limited transistor scaling regarding their power consumption. It can be easily observed in high-end FPGA platforms offering double gate resources at each new generation with no substantial reduction in transistor power consumption, even if they operate at a same frequency, causing an overall increase in power consumption of components that dramatically limits their usage. This poor transistor power reduction is even worse in deep sub-micron technology nodes below 65 nm. After this node, one cannot count on technology scaling anymore to solve the power consumption increase due to platform enhancements. Moreover, deep sub-micron technology nodes raise further limitations for their usage as an easy gate count provider, as it was the case during previous decades. Those limitations are process variations and leakages. Process variations are due to manufacturing hazards, leading to important electrical characteristics variations of transistors over a single component. At platform-level, this causes a single wide synchronous design spanned over an entire chip to operate at a very low conservative frequency. Leakages increase transistor power consumption, even if they are not used. They impose to use a high threshold voltage (Vth), especially in power-constrained embedded applications. Power supply voltage (Vdd) is also reduced as much as possible in order to reduce the dynamic power consumption that is proportional to Vdd square. This reduction of Vdd while maintaining a high Vth that strongly mitigates operating frequency increase with new technology nodes. Indeed, for embedded processes one assists to barely no frequency improvements since the 90 nm node.
Applications require higher computation throughput with a high-level of programmability while the technology still provides higher transistor count but without significantly reducing their power consumption. It obviously does not match the embedded constraints of reducing the total power consumption due to a limited power budget. The impacts of those conflicting constraints on future embedded processing platforms leads to the following requirements:                High programmability to support complex algorithms with complex control paths and software-defined applications        High level of parallelism to support the high computation needs with a limited operating frequency        High power efficiency in terms of operations per watt, to support high computation load in a limited power budget while using the future technology nodes.        
Existing Approaches
The main approach used today to fulfill embedded platform requirements is heterogeneous multicore. Here cores are processing resources that can be GPP (general purpose processor) digital signal processors and dedicated accelerators. Multicore is used to increase overall parallelism execution since limited frequency does not allow to support complete applications using a single processor core even with coprocessor support. Heterogeneity comes by the use of domain specific accelerators to improve power efficiency. A platform is always build around a GPP surrounded by accelerators connected by a bus. Accelerators are mostly dedicated hardware implementations of defined functions or have limited configurability within a specific algorithm domain.
There are four major issues raised by this approach limiting its use for future embedded computing platforms. The first is that there are many domains and even many standards within domains leading to a very high dedicated accelerator count [REF]. Different accelerators can even be used within a single domain depending on throughput and real-time constraint. The second issue with heterogeneous multicores is that they are complex platforms that are designed for a precise set of applications. It is therefore difficult to efficiently port new applications on existent platforms especially for more advanced standards. This leads to frequent redesign with functionality set modification like it is the case for example with cell phone handset platforms. The third issue is with silicon area that increases with the accelerator count. Heterogeneous multicores have a poor silicon utilization since few of them are really used at the same time. The fourth and last issue is raised when programming those heterogeneous platforms. Since they group heterogeneous components they require costly manual intervention to partition applications over the available resources. Moreover, this partitioning is platform dependent and needs to be very accurate to take benefit of all the resource capabilities without incurring the prohibitive cost of executing a task on an inappropriate resource. This causes that when the platform changes, the partitioning needs to be done again starting at application-level to the low assembly-level. Platform dependent partitioning causes therefore reusability issues and cost overhead.
Together with heterogeneous multicores, other low-power techniques are used. The most import one from an architectural point of view is the island-based Vdd scaling. With this approach, a chip is partitioned into islands that can operate at different Vdd and speed, to further minimize power consumption. The Vdd is dynamically adjusted depending on the real-time constraint of each island. Variable speed in each tile introduces latency issues in the inter-island communication network. In order to be latency tolerant, the different tiles of the chip are connected through FIFO (first in, first out) communication links supporting mesochronous clock synchronization. Island-based approach is foreseen as a main architectural solution to cope with process variations in large chip.
The current heterogeneous multicore approach is very difficult to follow with the fast growing of standards requirements and services. Even cell phone handset platforms that are today implemented using heterogeneous multicore solutions have to face those limitations, even if those handsets benefit of very high production volumes allowing to amortize design costs. Other markets not driven by very high volumes as it is the case in professional electronics applications cannot take this approach due to prohibitive silicon, design and programming costs. For those reasons important researches and improvements are made on the accelerators to make them more flexible by improving their programmability while keeping them low power.
A mature solution for improving accelerators flexibility is the SIMD (single instruction, multiple data) vector processor architecture. It is already used for a while as multimedia accelerator in desktop processing and it is used today in real embedded platform products for video and baseband processing. SIMD offers a good compromise between programmability and power efficiency. It allows to implement a wide variety of standard with a single accelerator. Nevertheless, it is very difficult to program since algorithms need to be vectorized and compiler support is either experimental or is limited to very specific language constructs. Furthermore, it does not support at all sequential code. For that reason it is always used in association with a GPP or a VLIW leading to an heterogeneous accelerator. It needs to be manually and very accurately programmed to obtain expected performances. When applications became more complex, the difficulty of vectorization limits performances and the power efficiency of this solution.
A more advanced research approach to make power efficient and programmable accelerators is coarse grain reconfigurable processing arrays. They allow to implement custom datapaths that are configured at runtime. Configuration can even be rewritten at each clock cycle depending on the configuration controller and configuration memory capacity. They have limited branch support that are achieved in the configuration controller or by using predicated execution. It is therefore not possible to execute sequential code or kernels with complex control paths on those accelerators. As for the SIMD architecture, control and sequential parts of the code are executed on an heterogeneous GPP or a VLIW next to the accelerator. They have barely no compiler support requiring tedious manual programming. They have a limited kernel size support when using a single configuration or they have to face an important power consumption overhead in continual configuration memory loading. The power consumption of the configurable interconnection fabric is high due to the reconfigurability capability overhead. Furthermore, reconfigurable interconnection fabric introduces latencies due to wire length and their high fanout that cannot be easily pipelined. Their operating frequency is therefore not very high.
The last main solution to heterogeneous multicore platforms issues is the homogeneous multicore approach. It is the approach used in this work. Homogeneous multicores are made of an array of highly programmable processor cores like optimized scalar RISC, DSP or VLIW processors. The processors in the array are connected together through a dedicated programmable or reconfigurable communication network. They benefit of a good compiler support, allowing to port applications with very few manual intervention compared with the other approaches. Their uniform ISA (instruction set architecture) does not require precise partitioning. They can execute sequential code and support kernels with complex control paths. Thanks to their scalable communication network they can exploit a very high level of parallelism that can reach hundreds of cores for some platforms. Nevertheless, beside their very good programmability and scalability they are limited in power efficiency due to their use of fully programmable processor cores. When homogeneous multicores are build with simple scalar RISC cores to be low power, they cannot exploit ILP in the sequential parts. Indeed, they have to use communication network for inter-core communication that has inefficiencies for ILP limiting their performances when executing sequential part of the applications. Their power efficiency is also lessen by the inter-core communication network overhead.
As a conclusion, homogeneous multicore approach solves many issues raised by heterogeneous multicores. A same computing fabric is used for all the application domains. The platform design is simple and generic and is reused for many applications. All the cores can be used whatever the application, leading to a good silicon surface utilization. They are easily programmable and benefit of a good compiler and tool support requiring very few manual intervention. Regarding processing platform constraints raised by the future embedded systems, they have a very high programming capability, their scalability allows to exploit a very high level of parallelism and their relative simplicity allow to easily bound their WCET (worst case execution time) needed to guarantee quality of services in real-time constraints. Nevertheless, their use of fully programmable processor cores and inefficient communication network leads to a low power efficiency which is a major drawback for their use in embedded platforms.
When solving the current heterogeneous multicore platform limitations by using the homogeneous multicore approach, the problem that needs to be addressed is to have a fully programmable processor core that can exploit ILP with a very high power efficiency and efficient inter-core communication support.
Related Works: Dataflow Architectures
Dataflow architectures are studied and used for several decades [78, 79, 80, 81, 82]. The first known implementation as a complete processor core has been achieved by Dennis in the early seventies [78]. Dataflow architectures are used to exploit fine-grain instruction-level parallelism and obtain a high level of parallelism during execution. Some architectures even use dataflow to automatically extract parallelism out of a sequential thread of code.
In the general dataflow model, data manipulated are decorated with a tag to form tokens. By this, data can be atomically manipulated and distinguished between each other without using a central controller. Instructions consume data token as operands and produce tokens as results. They are executed asynchronously on independent processing elements (PE) following the dataflow firing rule. This rule is that an instruction can be executed when all its operands are available. After being produced, data tokens are stored and wait until they are consumed by instructions.
There are two main architectural models implementing the dataflow execution model. The first makes dynamic tag comparisons in content addressed memories to match produced data with their consumer instructions. The second model uses explicit token communications in register tables. Registers are associated with instructions and are accessed by indexes. Most of the other dataflow architectures can be derivated from those two models that are detailed in the following.
The first dataflow architectural model is used in superscalar processors [83, 84, 85, 86] described in HPS [83] and by Weiss and Smith [85]. Superscalar uses the Tomasulo scheduling [87] introduced in the 60' in IBM computers. It was used to hide floating point units execution latencies by executing in parallel selected instructions ahead in the sequential execution flow of the program, while the current instruction is executed. Superscalar uses queues of instructions waiting to be executed on their dedicated unit. When a data is produced by a PE, its data decorated with its tag is broadcasted in all instruction queues by means of the common data bus. If the tag matches an instruction operand tag in a queue, the data is copied in the operand place. When an instruction has all its operands ready, it can be executed on its PE. The oldest instructions in queues have an higher priority. A variant scheme uses a separate data register file and only the tags are stored in instruction queues in order to reduce their complexity. This latter approach requires an extra pipeline stage. It is used for example in the alpha 21264 processor [88].
This dataflow model has two important particularities regarding the dataflow model presented in this work. The first is that a produced token needs to be presented to all entries of all queues in order to ensure to match all potential consumers. The common data bus is therefore made of long wires having a high fanout leading to important power consumption. The second particularity is that once instructions are in queues, branches can only be supported by using predicates that nullify instructions belonging to the wrong path. Moreover using predicates has the disadvantage to load instructions of both paths following a conditional branch. Instructions in queues are meant to be executed as part of a continuous sequential flow. The branches are therefore weakly supported by this dataflow execution engine. In order to mitigate this issue, this model uses a branch predictor [89] together with a state recovery mechanism in case of misprediction [90]. The branch predictor provides a single path instruction flow to the dataflow engine.
The Wavescalar architecture [91, 92] uses this dataflow model with tags and instruction queues to exploit instruction-level parallelism. They propose a hierarchical architecture where computation is distributed and executed directly in caches called WaveCaches. Unlike in superscalar, the architecture does not execute sequential programs but uses a dedicated ISA to directly execute dataflow code. They load blocks of dataflow instructions called Waves in queues associated with PEs. Branches are achieved by predicates but also between Waves loading. Wavescalar has the advantage to exploit a very high level of parallelism. The dedicated dataflow ISA strongly reduces its complexity compared to superscalar. Nevertheless, token broadcasting and tag comparisons still need to be done in all queues which is highly power consuming and lessens its power efficiency.
The second main dataflow architectural model has been published as scheduled dataflow [93] and in TRIPS [94]. It does not use instruction queues but operands registers reserved by instructions in tables that are addressed by indexes. Producer instructions write their results explicitly in operand registers of consumer instructions. For this, two destination addresses are recorded in producer instructions, corresponding to the consumer address registers in tables. When a data is used more than two times, a copy operation needs to be issued. Instructions having their operands ready can be selected to be executed on their PE. A pipeline cycle is dedicated to instruction selection. TRIPS implements this model with a set of tables associated with their PE, a set of register files and data memory banks. Instruction tables of all PE are visible through an operand NoC and are part of a same address space. This distributed address space allows instructions to communicate between each others even if they belong to different PE. The architecture has several memory ports with load/store queues and supports multithreading. It targets to support instruction-level, memory-level and thread-level parallelism to be a polymorphous execution platform with a very high level of parallelism. A drawback is the NoC latency that penalizes data communication of instructions belonging to different PE. The use of separate operands table, the NoC and separate register files reduce the power efficiency of this architecture.
The dataflow execution model presented in this work does not uses tags nor indexes. Instead of having instruction queues or tables associated with PE, the dataflow engine has only one instruction slot per PE. Program execution is achieved by first selecting instructions from a sequential sub-thread and then by executing them one by one in the dataflow engine. An instruction is taken from its local sequential flow and presented to the dataflow engine to be the next to be executed on its PE. Since dynamic selection only follows a sequential flow, an instruction scheduling is done statically in order to form local sequential sub-threads based on data dependencies and resources availability. Even if instructions are executed in dataflow, their execution order in a tile is determined during compilation. This removes the need of large tables or queues by making it possible to use only one instruction slot per PE. Nevertheless, the out-of-order execution capability of this model is very limited and cannot be used to automatically hide execution unit latencies without compiler support.
The dataflow engine itself is made of fixed point-to-point DFLs. Instructions select dataflow links to take their operands from, and send their results. The dataflow engine has a minimal complexity to achieve communications between PEs. It does not need queues with content accessed memories nor indexed tables with operands networks. This allows to reduce latencies to their minimum corresponding to the wires latencies between PEs. Furthermore, this allows to obtain a higher power efficiency since it reduces switching activity and wire capacitance. Instructions selection in sub-threads is like executing a local sequential flow which support branches. Flags used by conditional branches are directly provided by the dataflow engine to the local sub-thread controller. This is done the same way than data are communicated between PE. Sequential controllers of tiles belonging to a cluster are part of a same control domain allowing a cluster to execute any control path like in classical RISC machines. This therefore mitigates branch handling issues encountered in the other dataflow models.
Homogeneous Architectures for Embedded Systems
Homogeneous architectures are good candidates to cope with challenges raised in future embedded systems. In addition to their relative simplicity, numerous architectures have already been proposed. An early contribution in homogeneous parallel machines is the Transputer [95]. It has been proposed in the early 80' as a single processor chip with inter-core logic intended to be used to build highly parallel multiprocessor systems. It uses serial point-to-point connections between neighboring processors using a mesh topology. Inter-core communications were achieved by issuing dedicated move instructions.
More recently, numerous homogeneous platforms were proposed specifically for embedded systems, requiring a massively parallel execution for very computational intensive workload. Those platforms are called MPPA for massively parallel processor arrays. Their goal is to offer as much parallelism as possible on a single chip, reaching several hundreds of cores. They are made of simple RISC core with their local memories connected by a dedicated communication network. Almost all those architectures use FIFOs as communication primitives. They allow to communicate and synchronize execution on separate cores with moderate overhead, without using a central controller which is impracticable in massively parallel platforms. The main contributions in MPPA are presented here.
PicoArray processors from picoChip propose wide multicores up to 300 cores on a single chip [96]. Each core is 16-bit 3-issue VLIW. They are connected by a bus using time division multiplexing implemented by programmable switches. The weakly programmable processor array (WPPA) is a research platform [67]. It uses a reconfigurable network using FIFOs. The AsAP architecture [97] uses the same reconfigurable communication model but has the particularity to locally modify voltage and frequency depending on the workload, in order to further reduce power consumption. They take benefit of FIFO based communications that allows to easily cope with retiming issues raised by multiple clock domains.
Ambric proposes a multicore made of hundreds of cores connected by dataflow reconfigurable network [98, 35, 66]. They use communication links similar to small two-register FIFOs called CHANNELS. The difference with FIFOs is that they embed in the links special control logic able to manage data production and consumption of connected CHANNELS without having to implement a separate control process like with FIFOs. They have also the particularity to allow instructions to be streamed and duplicated using the data CHANNELS to control several distant cores from a same instruction source.
Finally, the Tilera processor cores [99] implementing the RAW architecture [37, 64, 100] is made of 64 3-issues VLIW cores. Inter-core communications are handled in two ways. The first is a network using programmable routers able to implement time division multiplexing. The second inter-core communications capability is achieved by FIFOs connecting neighboring cores. FIFOs are accessed in the register file through register indexes.
The difference between those MPPA platforms and the proposed architecture is that even if they use FIFOs, they are connected between cores and are not used inside a parallel core to communicate data. The use of FIFOs between cores introduces latencies. FIFOs are connected between neighbors in mesh topology or require to dynamically reconfigure a network. When the network is properly configured, the minimum latency between cores is one cycle. But in this latter case communications are fixed and programmability is lost. Those architectures cannot exploit fine-grain parallelism spanned over several cores, while the proposed architecture can exploit fine-grain parallelism between tiles inside a cluster. Moreover, since those multicore architectures are made of separate cores, they all belong to different control domains. This means that a flag produced in one core by a comparison instruction cannot trigger conditional branch execution in other cores. They have to communicate the flag as a full data word through the reconfigurable network to register locations of other cores before using it for a branch. All this takes several cycles. Moreover, if several cores need to branch, copy instructions have to be issued. The proposed architecture allows to branch in a tile the very next cycle a comparison instruction has been executed on any other tile of a cluster.
Due to those limitations, MPPA are mainly used to execute streaming applications without complex control requirements. Indeed, an important change in computation layout requires a costly reconfiguration process. They are therefore well suited for applications that can take benefit of spatial computation on the array and where application phases are executed during a relatively long period. Regarding power consumption, a high number of cores increases throughput but does not increase the overall power efficiency of the platform. The big MPPA components cited before consume on average around 10 Watts, which is relatively high for highly constrained embedded systems.
Another main difference with MPPA is that LiteFlow does not use local register files accessed by indexes in instructions. The reason is that its reading to fetch operands introduces an extra pipeline stage. This increases the conditional branch delay penalty causing speedup drawbacks. It is an important issue in widely parallel spatial computation where the number of instructions in loops reaches one. Mitigating branch delay by unrolling causes an important increase of kernel sizes that limits the use of local instruction buffers. Those issues are detailed in the following chapter.
Register values can always be read, which is not the case with dataflow, and if a data is used by several consumers it has to be duplicated in some way. Using a local register file causes instructions to manipulate data by indexes. It is therefore not suitable to mix local register file-based computation with dataflow. Indeed, with register indexes, it is not possible to choose to consume a value or not, as in LiteFlow. The result of a register-based ISA is written in one or sometimes two registers, requiring to duplicate data. In LiteFlow, bits are used to select destination DFLs. This allows to broadcast data to all potential consumers, without issuing duplication instructions.
Dedicated Accelerator Architectures
Dedicated architectures have proposed to increase parallelism and power efficiency of computational tasks in embedded systems. The transport-triggered architecture (TTA) implemented in the MOVE framework make only explicit data communications to computation resources [101, 102]. Here the data transport to the inputs of execution triggers a particular operation execution. The architecture uses a communication bus connected to functional units by means of sockets. Long instructions specify explicitly communications between sockets that trigger parallel execution of operations.
Dedicated architectures for communications in multicore platforms have been proposed. The CPA model (co-processor array) is made of processing elements connected to a programmable network by means of FIFOs [103]. Here PEs can be processor cores or custom heterogeneous accelerators. An enhanced version of NoC has been proposed with Aethereal [104] that puts emphasis on quality of services which is a key issue in embedded hard real-time applications.
The ADRES platform is made of an accelerator strongly coupled with a VLIW [31]. The accelerator is a coarse-grain reconfigurable array (CGRA) introduced in the MORPHOSYS architecture [32]. The VLIW is used for the control intensive part of applications and kernels are accelerated in the CGRA. It is made of interconnected PEs having a local register file. It is completely synchronous and does not use dataflow. The entire CGRA is controlled synchronously by a configuration vector loaded at each cycle from the configuration memory. Branches are achieved by using predicates.
Finally, two main contributions in coarse-grain reconfigurable platforms are Montium [34] and PACT-XPP [33]. They target very high power efficiency for intensive embedded computational platforms. Montium is made of simple PEs that are ALUs and memory blocks interconnected by a programmable crossbar. The configuration vector is loaded at each cycle and its controller allows to branch in configuration memory. The XPP architecture is made of ALUs and small memories connected between each others by a reconfigurable dataflow network. Those two architectures target kernel acceleration for streaming applications.