Parallel processing may be implemented by a computer system to achieve faster execution of applications over traditional sequential processing. For example, single instruction multiple data (SIMD) is an example parallel process where a single instruction is performed simultaneously on multiple data. Accordingly, SIMD may help speed up data processing in applications including multimedia, video, audio encoding/decoding, 3-Dimensional (3-D) graphics and image processing. Program operations that access a same memory cell, however, may need to be synchronized to reduce unintended results such as data corruption if the program operations access the same memory cell in parallel. Thus, a plurality of atomic operations may be implemented, wherein a computing operation (e.g., read, modify, and write to a memory cell) is forced to be completed prior to execution of a subsequent computing operation.
A compiler may apply a process to execute a plurality of atomic operations. In a uniform source value operation where a destination address is uniform (e.g., all SIMD lanes write to the same address) and a source value is uniform (e.g., all SIMD lanes are writing the same value), the compiler generates instructions to determine a number of active SIMD lanes in an execution mask, to execute an atomic operation by multiplying the number of active lanes to a source value, and to propagate a return value to various lanes by maintaining a counter and right shifting the execution mask until it becomes 0. Thus, a value stored at the memory cell is changed after each of the atomic operations (e.g., adds) and a different return value is propagated to a corresponding lane after each of the atomic operations. In a non-uniform source value operation (e.g., SIMD lanes are not writing the same value), the compiler may apply an even more complex process to execute the atomic operations since calculations are more complex.
The compiler, however, generates a large number of instructions that may reduce performance of a parallel processing computing system. For example, the compiler may apply a process to generate a set of instructions that are executed in a loop. In a uniform source value operation where all lanes are enabled, the loop may execute for 8 (SIMD8), 16 (SIMD16) times, respectively, which amounts to about 98 (SIMD8)/98 (SIMD16) instructions being executed. Impact to performance becomes more dramatic in a non-uniform source value operation since the compiler generates even more instructions as a result of more complex calculations, wherein multiple loops (e.g., two loops) may be introduced that offset any gain. In addition, a number of instructions may increase as the number of SIMD lanes increases. For example, the compiler may generate about 710 (SIMD32) instructions in a non-uniform source value operation.