Many important software applications exhibit large amounts of data parallelism, and modern computer systems are designed to take advantage of it. While much of the computation in the multimedia and scientific application domains is data parallel, certain operations involve costly serialization of the operations thereby increasing the run time. Examples include superposition type updates in scientific computing and histogram computations in media processing. A more specific example of a serial operation is known as scatter-add. The term scatter-add refers to a data-parallel operation in a form that relates to the well-known scalar fetch-and-op, specifically tuned for Single Instruction, Multiple Data (SIMD)/vector/stream style memory systems. Typically, a scatter-add mechanism scatters a set of data values to a set of memory addresses and adds each data value to each referenced memory location instead of overwriting it.
One commonly used algorithm related to operations that involve serialization is a histogram or binning operation. Given a data set, a histogram is simply the count of how many elements of the data set map to each bin.
Histograms are commonly used in signal and image processing applications, for example, to perform equalization and active thresholding. An inherent problem with parallelizing the histogram computation is memory collisions. Memory collisions occur where multiple computations performed on the data set update the same element in memory and often result in the creation of erroneous data due to the sequence of the operations. For this reason, many systems do not permit certain (requests for) operations to be performed in parallel or concurrently with other operations to the same memory location(s). Requests for operations to be performed in this regard are sometimes called “atomic” requests.
One conventional approach that attempts to address this problem is to introduce expensive synchronization. Before a hardware-processing element (PE) updates a location in memory that holds a histogram bin, it acquires a lock on the location. This lock is to ensure that no other PE will interfere with the update and that the result is correct. Once the lock is acquired, the PE updates the value by first reading the value from memory and then writing back the result. Finally, the lock is released so that future updates can occur.
This seemingly straightforward approach can be sometimes complicated by the SIMD nature of many architectures, which call for very fine-grained synchronization, as no useful work is performed until all PEs have acquired and released their lock. To overcome this limitation, parallel software constructs have been developed. One such construct is known as segmented scan and involves analyzing the targeted data in a segmented manner in order to improve control access to the lock-and-release memory access approach.
Previous processor-in-memory circuits suggest a fetch-and-add mechanism for atomically updating a memory location based on multiple concurrent requests from different processors. For example, an integer-only adder is placed in each network switch to serve as a gateway to the distributed shared memory. While fetch-and-add could be used to perform general integer operations, its main purpose is to provide an efficient mechanism for implementing various synchronization primitives. For this reason, fetch-and-add has been implemented in a variety of ways and is a standard hardware primitive in large scale multi-processor systems.
Several designs for aggregate and combining networks have also been suggested. For example, one suggested fetch-and-add mechanism includes a combining operation at each network switch, not just at the target network interface. The control network, which is not based on memory location, performs reductions and scans on integral data from the different processors in the system.
While each of these conventional approaches has its merits, improvement therein can be realized in terms of, among others, processing efficiency (speed) and robustness. For example, where atomic memory operations are implemented by serializing the requests or by software constructs involving excessive processing overhead, both processing efficiency and robustness can be adversely impacted. Robustness can also be a concern for conventional processor-in-memory architectures that use highly customized internal functional units to provide control over such request-processing. These and other issues have presented challenges to atomic-memory-request processing.