Over the years, GPUs have evolved into a computational workhorse for embarrassingly data-parallel computations. Together with the CPU they form a heterogeneous computing model wherein sequential part of the application runs on the CPU while the data parallel portion of the code is executed on the GPU. Compared to traditional high performance computing solutions, GPU computing provides massive data parallel hardware at a fraction of that cost. However building a parallel application as used in areas such as finance, seismology etc. is not easy. Even after the advent of the Compute Unified Device Architecture (CUDA) programming platform, it still takes considerable effort on the part of an application programmer to write a highly optimized GPU kernel. Further, there aren't any tools that can assist developers to transform a sequential code to a highly optimized GPU ready version. Hence it is essential to build such tools that will assist a developer to transform a sequential code to SIMT and gain significant speedups without having to worry much about the underlying GPU architecture on which the code would execute. Currently, to build such a tool, an elaborate analysis of the run time of the code is essential. Clients/customers typically make available only a partial version of the entire sequential implementation to IT service vendors who want to do such an analysis.
The current scenario in co-processor development domain is that Intel and AMD are working on their next generation of processors that have highly powerful CPU clusters and vector processors. For instance, Intel's latest Sandybridge, Ivybridge and Haswell CPUs are all Xeon based processors which can perform vector operations. The most advanced of this series is called Xeon Phi which is a vector co-processor that works in conjunction with the CPU through a PCIe bus. From this perspective, the Xeon Phi architecture is quite similar to a Xeon class CPU and a GPU co-processor.
Intel makes use of a tool called Intel Parallel Studio which tries to help developers identify parallel portions of a code and what could be the potential performance gain if they run the program on 2, 4, 8 or 16 cores. The tool also helps identify memory and threading errors. However Intel Parallel Studio does not have any feature to estimate that given a single threaded program, what should be the speedup necessary from each data parallel portion of the code, so as to give a certain amount of overall speedup of the program, end to end. The tool also does not have any feature to evaluate the overall speedup of the code taking into account data communication costs via the PCIe bus.
As far as processors from AMD are concerned, there are two types. One being the APU and the other the GPU. The AMD GPU is a co-processor to the CPU and communicates via the PCIe bus. So the communication latencies that are there with Xeon Phi and NVIDIA GPUs are there with AMD GPUs too. With AMD APUs, AMD has brought graphics capabilities to the desktop processor. The latest in this class is the Kaveri APU that was unveiled in January, 2014. With this architecture, CPU and GPU are able to access the same memory address space. The GPU can access cache data from coherent memory regions in the system memory, and also reference the data from CPU's cache. So cache coherency is maintained. The GPU is also able to take advantage of the shared virtual memory between CPU and GPU, and system page memory can now be referenced directly by the GPU, instead of being copied or pinned before accessing. The limitations are that the maximum memory throughput is limited to about 51.2 GB/s. Though it improves graphics capabilities of desktop processors, such memory throughput is quite low when compared to NVIDIA's co-processors (177 GB/s for GTX 480 Fermi and 208 GB/s for Kepler K20) or Intel's Xeon Phi. AMD has come up with Accelerated Parallel Processing SDK (APP SDK), which also helps developers to identify performance bottlenecks in the OpenCL code. So this is similar to NVIDIA's parallel Nsight or CUDA Visual
Profiler which comes into play after a basic version of the parallel code has been written. However both AMD APP, NVidia Parallel Nsight assumes that one has already ported or (newly developed) an application for the underlying platform. As the application runs, the tool collects various runtime profile information and provides different insights. These tools have no capability to predict the speedup before porting.
Limitations of the existing technology range from non-availability of data to lack of a proper approach in handling such scenarios. The same is explained in further detail here below.
Non-availability of test data for dynamic analysis: performing a run time analysis on a partial implementation using limited test input does not always give a correct analysis. Further due to business demands, it is not always possible for a client/customer to undertake a proper run time analysis of an entire sequential implementation.
Inaccuracy of static analysis: On the other hand a static analysis can't predict the complexity of the program that is data dependent. Therefore, the analysis may not always be accurate.
Non-availability of code: The tools available for program analysis for parallelization assume that the programmer is to run the tools on the entire piece of sequential code. In business, often the owner of the sequential code is not available or competent for such analysis and expects an external expert to perform the analysis on the owner's behalf. Furthermore, the owner does not want the code to leave the premise. Consequently, it becomes a costly proposition for the owner to perform such an analysis on premise by a third party expert. As a result, owners often do not undertake such an exercise and try to port existing code with minimal changes onto the new platform like GPU. Obviously, this approach does not lead to optimal exploitation of the data-parallel infrastructure.
Holistic analysis approach is missing: In order to accurately analyze a code for parallelism, it is essential to consider different dimensions. These being: loop complexity, loop volume, understanding the nature of the input data that the program is supposed to handle, and nature of the program variables. A loop complexity analysis can indicate the amount of control flow complexity that exists inside the loop and if it is worth parallelizing. A loop volume estimation tells us how many times this loop will be executed. The nature of the input data can often give important clues regarding the run-time behavior of the application, specifically, how the control paths will be executed. In absence of real data, the nature of data plays an important role. The nature of program variables can give important hints related to optimal usage of memory.
All the tools that deal with the above aspect, work in silo. Unless they are properly integrated where they interact with each other and influence each other's analysis, the overall analysis will not be effective. All the leading platform vendors as well as researchers acknowledge that there should be enough supporting tools to assist the application developer to build efficient code that can exploit the underlying hardware's processing power.
In essence, the state of the art has the following limitations.                i) Inaccurate estimation;        ii) Ad-hoc strategy to port;        iii) Massive effort to even port a simple version; and        iv) Discovery of latency, and data-transfer issues happen while testing the application, causing the team to rewrite the application repeatedly to achieve the speedup.        