Random read accesses within a block of data is difficult to parallelize for vector SIMD operations because data needs to be fetched from non-sequential locations. Vector Load instructions can typically fetch only sequential data locations and in some cases certain predictable patterns for non-sequential data. When access is completely random, vector load instructions are insufficient to perform parallel data fetch.
The strength of vector SIMD engines lies in its ability to parallelize computations over multiple data points simultaneously. If it is not possible to parallelize computations over multiple data points for a given algorithm, performance of the concerned algorithm drops drastically and it offers no advantage over a scalar CPU core. Random read access within a block of data located in memory is one such problem where lack of parallelization causes a performance bottleneck. This kind of access pattern is present in key vision kernels like BRIEF feature descriptor (described in M. Calonder, V. Lepetit, P. Fua, “BRIEF: Binary Robust Independent Elementary Features”, ECCV, 778-792, 2010), and ORB feature detection and description algorithms (described in E. Rublee, V. Rabaud, K. Konolige, G. Bradski, “ORB: An Efficient Alternative to SIFT or SURF”, ICCV, 2564-2571, 2011). In vision kernels such as FAST feature detector (described in E. Rosten and T. Drummond, “Machine Learning for High-Speed Corner Detection”, Computer Vision—ECCV 2006, Lecture Notes in Computer Science, Volume 3951, 2006, pp 430-443) and CoHOG (described in T. Watanabe, S. Ito, K. Yokoi, “Co-occurrence histograms of oriented gradients for pedestrian detection”, PSIVT, pages 37-47, 2009), though the access pattern per block is fixed, the pattern is still fairly complex.