The high-performance computing (HPC) community, both hardware vendors and software developers, rely on an accurate count of floating-point operations executed. These measurements are used in a variety of ways, including distinguishing a system's actual computing floating-point operation (FLOP) performance compared to its advertised peak FLOP performance, and analyzing applications for the percentage of scalar FLOPs compared with packed FLOPs. Static analysis of the application to obtain this information can be difficult because during the execution, codes paths through the application may vary based on dynamic conditions, such as array alignment in memory, loop iteration counts dependent upon input problem size, and loop iteration counts dependent on algorithmic convergence requirements. Scalar operations are often used when data packing is not possible due to memory communication between the loop iterations, and are also used to “peel” iterations of a loop to achieve a particular memory alignment for packed memory operations.
FLOP has a precise definition within the HPC community, and it refers to single- or double-precision arithmetic operations (i.e., add, subtract, multiply, and divide), and does not include memory or logical operations. The some compound instructions, such as Fused Multiply Add (FMA) instructions count as multiple, in this example, two FLOPS, one for the multiply and one for the add. Each element in a packed single-instruction-multiple-data (SIMD) arithmetic operation counts as a FLOP (two in the case of an FMA). For example, a 256-bit packed single-precision (32-bit) floating-point add operates on 8 elements, and thus counts 8 FLOPs. Scalar operations use the full SIMD register data path, but only operate on a single element, and therefore only count 1 FLOP (2 in the case of FMA). There has been a lack of efficient mechanism that can accurately count the FLOPs in such an operating environment.