Processors or microprocessors have embraced parallelism to increase their performance. For example, Central Processing Units (CPUs) have added multiple cores. Graphics Processing Units (GPUs) have also evolved from fixed function rendering component into parallel processors. As these parallel platforms are made available, it is necessary to enable software developers to take full advantage of these parallel processing platforms.
Open Computing Language (OpenCL) is an open standard programming framework for writing parallel programs that can be executed across these parallel processing platforms. It uses task-based and data-based parallelism to provide parallel computing. OpenCL is managed by Khronos Group, a non-profit technology consortium.
OpenCL separates execution program code (i.e., kernel code) from management program code (i.e., host code). Host code refers to standard C language code that can be executed on any OpenCL supported parallel processing platform. Kernel code is a C-based programming language code specifying functions with restrictions and extensions that allow for the specification of parallelism and memory hierarchy.
Barrier synchronization is a required feature of the OpenCL programming model. It typically refers to a type of synchronization mechanism that halts or stops execution of any thread within a group that reaches the barrier point until all other threads of the group reach the same barrier point. Thread (or thread of execution) is the smallest execution unit that can be scheduled by an operating system. Barrier synchronization is typically provided by a built-in work-group barrier function that can be used by a kernel executing on a target platform to perform synchronization between threads in a group executing the kernel. Therefore, all the threads of the group must execute the barrier construct before any of the threads is allowed to continue execution beyond the barrier.
General CPUs and GPUs carry out barrier synchronization with their fixed register set (or processor register). Using the fixed register set to store bounded context data in known locations, the processor (e.g., CPU or GPU) can perform a context switch between threads (i.e., halting and swapping threads for execution) during their execution to achieve barrier synchronization. Bounded context data refer to data representing a thread's context that must be saved to known locations and later restored.
However, processors configured from field programmable gate array (FPGA) devices do not have such inherent architecture to support barrier synchronization because FPGAs only include transistor gates that can be programmed into state machines, data paths, arbitration logics, and buffers. There is no fixed register set. Instead, there are just live values distributed throughout the hardware, and state machines that control activities in the distributed data paths.