As is well known by those skilled in this specific technical field, a classic architectural solution to achieve high elaboration performance when dealing with critical algorithmic kernels is to enhance a general purpose microcontroller with application-specific signal processors and peripherals for the most time-critical functions.
As a matter of fact, in order to achieve sufficient volumes of transactions in presence of standards with a variable level of compliance, these platforms must often be over designed to cover the worst case of all requirements.
A further more fine-grain solution provides for a reconfigurability at instruction-set level, also improving the ease of interfacing peripherals. Another solution, developed by the Company Tensilica, offers a configurable processor “Xtensa” where instructions can be easily added at design time within the pipeline; see in this respect the article by R. E. Gonzales “Xtensa: a configurable and extensible processor” IEEE Micro, Volume: 20 Issue 2, March-April 2000.
However, the computational logic for new instructions is hardwired at design time with an ASIC-like flow, hence the processor can not be reconfigured after fabrication. This, although very successful, is still an application-specific solution with high non-recurrent engineering costs due to design and mask production.
An appealing alternative option is that of exploiting a Field Programmable Gate Array (FPGA) technology combining standard processors with embedded FPGA devices. This further solution allows to configure into the FPGA at deployment time exactly the required peripherals, exploiting temporal re-use by dynamically reconfiguring the instruction-set at run time based on the currently executed algorithm.
This solution is disclosed in the U.S. Pat. No. 5,956,518 to A. De Hon, E. Mirsky, J. Knight, F. Thomas, assigned to the Massachussets Institute of Technology and having title: “Intermediate-grain reconfigurable processing device”.
The existing models for designing FPGA/processor interaction can be grouped in two main categories:                the FPGA is a co-processor communicating with the main processor through a system bus or a specific I/O channel;        the FPGA is described as a function unit of the processor pipeline.        
The first group includes the GARP processor, known from the article by T. Callahan, J. Hauser, and J. Wawrzynek having title: “The Garp architecture and C compiler” IEEE Computer, 33(4): 62-69, April 2000. A similar architecture is provided by the A-EPIC processor that is disclosed in the article by S. Palem and S. Talla having title: “Adaptive explicit parallel instruction computing”, Proceedings of the fourth Australasian Computer Architecture Conference (ACOAC), January 2001.
In both cases the FPGA is addressed via dedicated instructions, moving data explicitly to and from the processor. Control hardware is kept to a minimum, since no interlocks are needed to avoid hazards, but a significant overhead in clock cycles is required to implement communication.
Only when the number of cycles per execution of the FPGA is relatively high, the communication overhead may be considered negligible.
In the commercial world, FPGA suppliers such as Altera Corporation offer digital architectures based on the U.S. Pat. No. 5,968,161 to T. J. Southgate, “FPGA based configurable CPU additionally including second programmable section for implementation of custom hardware support”.
Other suppliers (Xilinx, Triscend) offer chips containing a processor embedded on the same silicon IC with embedded FPGA logic. See for instance the U.S. Pat. No. 6,467,009 to S. P. Winegarden et al., “Configurable Processor System Unit”, assigned to Triscend Corporation.
However, those chips are generally loosely coupled by a high speed dedicated bus, performing as two separate execution units rather than being merged in a single architectural entity. In this manner the FPGA does not have direct access to the processor memory subsystem, which is one of the strengths of academic approaches outlined above.
In the second category (FPGA as a function unit) we find some disclosed architectures known as:                “PRISC” by R. Razdan and M. Smith “A high-performance microarchitecture with hardware-programmable functional units” Proceedings of the 27th Annual International Symposium on Microarchitecture, November 1994;        “Chimaera” by Z. A. Ye, A. Moshovos, S. Hauck, P. Banerjee “Chimaera: A High-Performance Architecture with Tightly-Coupled Reconfigurable Functional Unit” Proceedings of the 27th International Symposium on Computer Architecture, 2000 Page(s): 225-235;        “ConCISe” by B. Kastrup, A. Bink, and J. Hoogerbrugge “ConCISe: A compiler-driven CPLD-based instruction set accelerator” Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines, April 1999.        
In all these models, data are read and written directly on the processor register file minimizing overhead due to communication. In most cases, to minimize control logic and hazard handling and to fit in the processor pipeline stages, the FPGA is limited to combinatorial logic only, thus severely limiting the performance boost that can be achieved.
Later attempts, like the “OneChip” solution by R. Wittig, and P. Chow “OneChip: An FPGA Processor With Reconfigurable Logic Proceedings” disclosed in the IEEE Symposium on Field-Programmable Custom Computing Machines, pp.126-135, Napa Valley, Calif., March 1996, or the processor architectures proposed in the already cited U.S. Pat. Nos. 5,956,5181 and 6,026,481 address the communication problem effectively sharing registers between a processor core and an independently embedded FPGA device.
These solutions represent a significant step toward a low-overhead interface between the two entities. Nevertheless, due to the granularity of FPGA operations and its hardware oriented structure, their approach is still very coarse-grained, reducing the possible resource usage parallelism and again including hardware issues not familiar nor friendly to software compilation tools and algorithm developers.
Thus, a relevant drawback in this approach is often the memory data access bottleneck that often forces long stalls on the FPGA device in order to fetch on the shared registers enough data to justify its activation.
A more recent architecture exploiting a remarkable trade-off between the models above cited is known as “Molen” processor, developed at the TUDelft; see in this respect the article: “The MOLEN rm-coded Processor”, Proceedings of the 11th International Conference on Field-Programmable Logic and Applications 2001 (FPL2001), Belfast, Northern Ireland, UK, August 2001.
“Molen” main advantage is to utilize commercially available FPGA devices to build an embedded reconfigurable architecture that couples existing processor models (Altera Nios, IBM PowerPC) with well known gate-array technology (Altera Apex 20KE, Xilinx Virtex II Pro) obtaining significant performance speed-up for a broad range of DSP algorithms.
However, even this solution presents some drawbacks due to the fact that the extension to the processor instruction set are designed by the architecture designers, and “microcoded” in the architecture itself, rather than developed at compilation time by the user.
Moreover, due to the coarse grain of the tasks involved in the instruction set extension, the size of the introduced reconfigurable logic can severely affect the energy consumption for a given algorithm.