Using field programmable gate array (FPGA) devices as coprocessors to accelerate high-volume data-center applications is already a well-established solution in today's high performance computing (HPC) field. This approach not only can offload significant amount of computation workloads from a central processing unit (CPU) to FPGA and free up CPU for other tasks, but can also achieve better performance thanks to FPGA's high parallelism, high throughput, low power consumption, and low and deterministic latency.
Nowadays the state of the art FPGA design methodology typically uses two types of design flows, both are still very hardware oriented and have a significant problem with a long development cycle, limited scalability and inflexibility. In the first type of design flows, after receiving functional requirements for a software application, FPGA designers come up a hardware design architecture and write in hardware description language (HDL), such as Verilog™ and VHDL, as their design entry to build a FPGA design from a block level to a system level. This method requires a special knowledge of parallel programming and HDL. It requires incredibly detailed design to cover every aspect of hardware circuits. Due to the complexity and a long development cycle of HDL, this method usually isolates the hardware development from software evolvement, hence cannot match the fast pace of software progressing, nor support fast iteration of software optimization. Using this method it also means a significant amount of work is required when porting applications from generation to generation of FPGA technology.
In the second type of design flows, after extracting and identifying some functional blocks in the software application, FPGA designers use a high-level synthesis (HLS) tool to perform some level of code transformation then generate HDL based on the original C/C++ source codes, then plug these HDL blocks into a system-level design using HDL. This method somewhat shortens the HDL development time thanks to HLS's C to a register transfer level (RTL) compilation flow. However, it usually predefines a fixed top-down hardware architecture including fixed data paths, data flows and topology among multiple functional blocks. Such fixed architecture cannot easily nor effectively support Software re-partitioning or data flow rearrangement, because interfaces and interconnects among those functional blocks are still written in HDL. Writing and verifying HDL codes are still time-consuming tasks.
Furthermore, having multiple FPGAs working together to provide a scale-up solution to expand computation capacity and support large scale software routine is also required in a high-performing computing (HPC) system. In both design flows as described above inter-FPGA communication is predefined as part of the hardware architecture instead of a software abstraction. It often requires software/hardware co-design to settle on a fixed architecture. Such created architecture dedicated to certain software application does not provide software developers a flexible and adaptive solution that allows them to iterate different data flows.