Signal processing in general, and processing of images in particular, require significant computational powers, especially over the last few years with the rapid increase in the resolution of image sensors. In the field of embedded applications aimed at the general public, heavy constraints in terms of fabrication cost are added to the constraints of electrical consumption (of the order of a few hundred milliwatts). To respond to these constraints, image processing is commonly carried out on the basis of dedicated computation modules operating in data flow mode. The “data flow” mode, as it is commonly known in the literature, is a data processing mode according to which the data entering the computation module are processed as and when they arrive, at the rate of their arrival, a result being provided as output from the computation module at the same rate, optionally after a latency time. Dedicated computation modules make it possible to comply with the fabrication cost constraints on account of their small silicon area and the performance constraints, notably as regards computational power and electrical consumption. However, such modules suffer from a flexibility problem, it not being possible for the processing operations supported to be modified after the construction of the modules. At the very best, these modules are parametrizable. Stated otherwise, a certain number of processing-related parameters may be modified after construction.
A solution to this lack of flexibility consists in using completely programmable processors. The processors most commonly used are signal processors, well known in the literature under the acronym “DSP” for “Digital Signal Processor”. Drawbacks of these processors are their significant silicon footprint and their electrical consumption, often rendering them ill-adapted to highly constrained embedded applications.
Compromises between dedicated computation modules and completely programmable processors are currently under development. According to a first compromise, a circuit comprises a data processing unit having very long instruction words, called a VLIW (“Very Long Instruction Word”) unit, and a unit making it possible to execute an instruction on several computation units, called an SIMD (“Single Instruction Multiple Data”) unit. In certain current constructions, computation units of VLIW and/or SIMD type are implanted in the circuit as a function of the necessary computational power. The choice of the type of unit to be included in the circuit, of their number and of the way they are chained together is decided before the construction of the circuit by analyzing the application code and necessary resources. The order in which the units are chained together is fixed and it does not make it possible to subsequently change the chaining of the processing operations. Moreover, the units are globally fairly complex since the control code for the application is not separate from the processing code. Thus, the processing operators of these units are of significant size, thereby leading to an architecture whose silicon area and electrical consumption are more significant for equal computational power.
According to a second compromise, a C-language code may be transformed into a set of elementary instructions by a specific compiler. The set of instructions is then implanted on a configurable matrix of predefined operators. This technology may be compared with that of FPGA, which is the acronym for “Field Programmable Gate Array”, the computation grain being bigger. It does not therefore make it possible to obtain programmable circuits, but only circuits that can be configured by code compilation. If it is desired to integrate parts of program code that are not provided for at the outset, computation resources which are not present in the circuit are then necessary. It therefore becomes difficult or indeed impossible to implement this code.
According to a third compromise, the data are processed by a so-called parallel architecture. Such an architecture comprises several computation tiles linked together by an interconnection bus. Each computation tile comprises a storage unit making it possible to store the data locally, a control unit providing instructions for carrying out processing on the stored data, processing units carrying out the instructions received from the control unit on the stored data and an input/output unit conveying the data either between the interconnection bus and the storage unit, or between the processing units and the interconnection bus. This architecture presents several advantages. A first advantage is the possibility of modifying the code to be executed by the processing units, even after the construction of the architecture. Furthermore, the code to be executed by the processing units generally comprises only computation instructions but no control or address computation instruction. A second advantage is the possibility of carrying out in parallel, either an identical processing on several data, or more complex processing operations for one and the same number of clock cycles by profiting from the parallel placement of the processing units. A third advantage is that the computation tiles may be chained together according to the processing operations to be carried out on the data, the interconnection bus conveying the data between the computation tiles in a configurable order. Moreover, the parallel architecture may be extended by adding further computation tiles, so as to adapt its processing capabilities to the processing operations to be carried out. However, the management of the data in the computation tiles is complex and generally requires significant memory resources. In particular, when a computation tile is performing a processing on a data neighborhood, all the data of this neighborhood must be available to it simultaneously, whereas the data arrive in the form of a continuous stream. The storage unit of the computation tile must then store a significant part of the data of the stream before being able to perform a processing on a neighborhood. This storage and the management of the stored data require optimization so as to limit the silicon area and the electrical consumption of the parallel architecture while offering computational performance adapted to the processing of a data flow.