Vectorization of sparse update pattern using a conflict detection package has its limitations. Consider an example code of sparse update:
for(i=0; i<N; i++){A[idx[i]] += B[i]}
This loop cannot be vectorized with a straightforward approach because it may have potential data dependencies when idx[i] has equal values on different iterations of the loop (referencing to the same memory address).
A conventional way to vectorize the loop is to check for conflicts of indexes with a conflict instruction that generates a result of comparing each index in a vector to each other, and based on this result values are loaded from B[ ] to a vector, permuted, accumulated, and stored to A[ ]. Accumulation is usually done in an inner while loop by permuting values based on a special permute control, which is generated based on the conflict result. This process is iterative and repeated as shown below:
zmm_A = Gather (A + zmm_index)zmm1 = VCONFLICT(zmm_index)zmm_control = generate_perm_control(zmm1)mask_completion = full_mask while(mask_completion!=0){mask_todo = compute_new_mask_todo(mask_completion,mask_todo)zmm_values = Permute(zmm_values, zmm_control)zmm_res = Add(zmm_res, zmm_values)mask_todo = compute_new_mask_completion(mask_completion,mask_todo)}zmm_A = VADD(zmm_A, zmm_res)Scatter (A, zmm_A, zmm_index)
The body and number of iterations of the inner while loop vary depending on the instruction set available and algorithm implementation. For example, if there are 16 equal indexes (corner case), then a simple algorithm implies 15 permutations and 15 additions.