Histogram computations and related statistical operations performed on a D-dimensional numerical set, S, such as min(S), max(S), mean X(S), standard deviation σ(S), and mode(S), are common operations employed in image processing systems. Histogram computations have also been employed in problems involving parallel execution, such as parallel execution of large sets, rapid throughput, or both. By way of example, the system and method taught in U.S. Pat. No. 8,451,384 utilizes multiple histograms and their intersection to provide one of several measures for shot change detection in high-resolution video. Unfortunately, efficiently performing these types of computations while leveraging massively multi-parallel hardware, which may include graphics processing units (GPU) and massively multi-core SIMD or MIMD vector processing systems, is lacking.
Early attempts to perform GPU-based histogram computations suffered from poor performance with respect to recursive reduction operations, for example, as taught in U.S. Pat. No. 7,889,922 (hereinafter the '922 patent). Such recursive reduction operations require large repeated recursions with small tile-size, or suffer from cache misses with large tile size and fewer recursions. This limits the utility and practical performance of recursive reduction operations for large data sets as taught by the '922 patent.
Other prior art methods avoid recursion by performing reduction in a single step using a feature of current GPU hardware, namely the reading of texture buffer values within a vertex shader, as disclosed in Scheuermann, T. and Hensley, J., 2007, “Efficient histogram generation using scattering on GPUs,” Proceedings of the 2007 symposium on Interactive 3D graphics and games (I3D '07), pp. 33-37 (hereinafter “Scheuermann and Hensley”). The reading of texture buffer values within a vertex shader as taught by Scheuermann and Hensley permits “scatter” operations, e.g., a destination write location is not fixed but variable based upon decisions that rely on input texture.
In contrast, the recursive reduction operations taught in the '922 patent only permit “gather” operations, where a write operation location is fixed, but a read operation is variable.
It should be noted that while the method of Scheuermann and Hensley exhibits good parallelism and a further benefit of scaling performance on only the input data set size and not the histogram bin size, it suffers from an inversion of performance wherein large bin sizes exhibit superior performance to smaller bin sizes. This unpredictability is due to serialization of memory write requests to a GPU cache, especially in data sets with high modalities, rendering such methods and systems wholly unsuitable for real-time stream processing applications where predictability is a necessity.
In Nugteren, Cedric, et al., “High performance predictable histogramming on gpus: exploring and evaluating algorithm trade-offs,” Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ACM, 2011 (hereinafter “Nugteren”), two histogram computation methods are disclosed that address the cache-collision problem, but both employ a proprietary API (CUDA) that is only available from a single vendor of GPU hardware. Further, these prior art methods direct themselves to a singular purpose, namely, the computation of a binned histogram using a GPU, and not any allied statistical functions. Additionally, for image and video processing, histogram functions have typically been performed off-GPU, such as on the CPU, introducing pipeline stalls and wait-states. These stalls render such systems and methods unsuitable for real-time image and video processing.
Accordingly, what would be desirable, but has not yet been provided, is a high throughput, memory efficient, GPU-vendor-independent, and flexible histogram and statistical method and system for computing histograms that exhibits consistent performance.