Modern processors often include instructions to provide operations that are computationally intensive, but offer a high level of data parallelism that can be exploited through an efficient implementation using various data storage devices, such as for example, single instruction multiple data (SIMD) vector registers. The central processing unit (CPU) may then provide parallel hardware to support processing vectors. A vector is a data structure that holds a number of consecutive data elements. A vector register of size M may contain N vector elements of size O, where N=M/O. For instance, a 64-byte vector register may be partitioned into (a) 64 vector elements, with each element holding a data item that occupies 1 byte, (b) 32 vector elements to hold data items that occupy 2 bytes (or one “word”) each, (c) 16 vector elements to hold data items that occupy 4 bytes (or one “doubleword”) each, or (d) 8 vector elements to hold data items that occupy 8 bytes (or one “quadword”) each.
Vectorizing an application or software code may include making the application compile, install, and/or run on specific systems or instruction-set architectures, such as for example, a wide or large width vector architecture.
The computing industry has developed various programming benchmarks to test the efficiency of architectures and computation techniques, such as vectorization, simultaneous multithreading, predication, etc. One suite of such benchmarks comes from the Standard Performance Evaluation Corporation (SPEC). The SPEC benchmarks are widely used to “benchmark” performance of processor and platform architectures. The programs that make up the SPEC benchmarks are profiled and analyzed by industry professionals in attempts at discovering new compilation and computation techniques to improve computer performance. One of the SPEC benchmark suites, called CPU2006, includes integer and floating point CPU-intensive benchmarks chosen to stress a system's processor, memory subsystem and compiler. CPU2006 includes a program called 444.NAMD, which is derived from the data layout and inner loop of NAMD, a parallel program for the simulation of large biomolecular systems developed by Jim Phillips of the Theoretical and Computational Biophysics Group at University of Illinois, Urbana-Champaign. Almost all of the runtime of NAMD is spent calculating inter-atomic interactions in a small set of functions. This set was separated from the bulk of the code to form a compact benchmark for CPU2006. The computational core achieves good performance on a wide range of machine architectures, but contains no platform-specific optimizations.
The program, NAMD, was a winner of a 2002 Gordon Bell award for parallel scalability, but serial performance is equally important. After one has vectorized all of the most parallel portions of the benchmark, for example, the non-vectorizable, serial portions typically represent an even more significant portion of the benchmark's runtime. This situation is a typical example of the general case for computationally intensive programs with high parallel scalability. After vectorization is used to speed up the most parallel portions, the hard work of removing performance limiting issues and bottlenecks to improve the performance of otherwise non-vectorizable or serial portions of the program remains.
To date, potential solutions to such performance limiting issues and bottlenecks have not been adequately explored.