Field of the Invention
This invention relates to a method for improving Open Computing Language (OpenCL) hardware execution efficiency, especially referring to a method for implementing cooperation between OpenCL software code and a high-efficient Field-Programmable Gate Array (FPGA) hardware platform.
Description of the Related Art
Open Computing Language (OpenCL) is the first open and copyright royalty-free programming framework that builds a unified programming environment between heterogeneous systems. OpenCL allows users to develop cross-platform programs based on C programming language, targeting CPUs, GPUs, DSPs and FPGAs, etc. It is a programming model for software engineers and a design methodology for system architects. Compatible with ANSIC standard (C99), OpenCL provides parallel computation mechanisms based on task division and data division for various heterogeneous platforms. The host program communicates with each heterogeneous platform through the Application Program Interface (API) to ensure that the task scheduling is accomplished efficiently and evenly.
For example, developed by Altera Company, the tool Altera SDK for OpenCL (AOCL) is a programming development environment for Field-Programmable Gate Array (FPGA) hardware platforms. The AOCL tool provides an OpenCL compiler and a high-level synthesis technology to convert OpenCL code to Verilog. It directly maps the high-level OpenCL code to FPGA platforms, and the workload of FPGA hardware coding can be reduced.
An OpenCL program is divided into two parts in AOCL environment, one part executed on the host and the other implemented in the hardware platform (i.e., the kernel). The code executed on the host is processed by a standard C compiler, and an executable program is generated in the x86 platform; the code implemented in the hardware platform is processed by the OpenCL compiler, and Verilog code is generated with the help of high-level synthesis technology. Then the physical synthesis tool Quartus II of Altera is called to process the subsequent implementation, placement and routing steps, and generates a downloadable FPGA configuration file. As a reference, FIG. 2 shows the flow diagram of the process above.
The direct mapping from the OpenCL code to the FPGA platforms shortens the time to market and reduces the developing difficulty of design schemes targeted at FPGA hardware platforms, which tremendously facilitates the spread of FPGA utilization among software engineers. In other words, design and programming for FPGA are able to be accomplished without a deep knowledge of FPGA hardware and Verilog. However, the performance of the FPGA platforms is sacrificed, as the Verilog code for FPGA programming is auto-generated by the high-level synthesis tool, which extensively uses templates, unified interfaces and buffers, and lots of FPGA resources have to be reserved for timing closure. Also, as the complexity of functions is increasing, the optimization efficiency of high-level synthesis tool decreases. Further, the more FPGA resources are used, the lower frequency the synthesized design achieves. Therefore, the performance would be unsatisfying low if the OpenCL code is too complex to optimize compared with Verilog design schemes.
There are various methods for improving OpenCL execution efficiency on FPGA hardware platforms, including optimizations in task scheduling, algorithm structures, compiling parameters, and hardware platforms, as described below:
In respect of task scheduling, developers could allocate certain types of accelerating tasks suitable for FPGA, such as logic operations ANDs or ORs, shifting and comparisons, as there are special components to efficiently process these tasks in FPGA hardware; and computational intensive tasks such as multiply-add operations of fixed and floating points can also be accomplished in FPGA as there are abundant DSP components which process data in parallel.
In respect of algorithm structure, developers could take full advantages of the special resources in FPGA, such as internal storage units, embedded peripherals, hardened floating-point DSP blocks, and shifting registers, to design an appropriate structure and data processing flow targeted at FPGA.
In respect of compiling parameters, developers could instruct the compiler by setting appropriate parameters for task scheduling and resource allocations. For example, defined by OpenCL standard, the parameter #pragma unroll is used for loop unrolling, and parameters num_computer_units and num_simd_work_items respectively set the numbers of computing units and parallel work items in the kernel. Currently, most OpenCL compilers for FPGA support these parameters, and with different parameters, the performance may differ significantly.
In respect of hardware platform, developers could choose different platforms according to the characteristics of the accelerating tasks, such as GPU, CPU or FPGA. For example, FPGA devices with huge logic resources and powerful DSP blocks are suitable for computational intensive tasks, while devices with high bandwidth and abundant interfaces are suitable for real-time and stream data processing.
The above methods for improving OpenCL execution efficiency on FPGA hardware platform all work at relatively coarse-grained levels, because users are unable to intervene the implementation of underlying hardware beyond the high-level synthesis tool. Therefore, these methods are limited to the disadvantages of high-level synthesis technology, namely the auto-generated code containing redundancy logic, using lots of buffers, and inferior to hand-coding Verilog code in working frequency and resource utilization. At present, it is impossible to optimize the kernel at the logic level for a specific goal, and the hardware efficiency of FPGA is not being used to its full potential with high-level synthesis tools.