Performance of a memory-bound algorithm on a processor with parallel processing capabilities (hereinafter a “parallel processor”), such as, for example, a graphics processing unit (GPU) or central processing unit (CPU), an accelerated processing unit (APU), etc., may suffer from lack of input memory locality. Pre-processing of input data, before execution of the algorithm, may worsen performance by, for example, increasing execution time.