The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
Computer vision is a field involving acquiring, processing, and analyzing digital images. Computer vision workloads are often data-intensive and are memory bandwidth-bound rather than compute-bound. For this reason, a common approach to parallelizing computer vision workloads is to split parts of the images into “tiles” and assign the tiles across multiple cores within a single node. Commodity hardware provides the necessary high-memory bandwidth with low-memory access latency for such computer vision workloads. The memory bandwidth requirement stems from the need for several stages of synchronization of computation across cores. However, extending the same approach to parallelism across multiple nodes is inefficient because, in multi-node systems, the inter-node network communication latency is orders of magnitude higher than the memory latency within a single node.
For example, in computing disparity for stereo vision, there are sequences of computer vision kernels performing a few basic integer arithmetic and comparison operations per pair of pixels from two images. An example operation is computing absolute differences between two grayscale pixels. There are also similar data intensive computer vision workloads that have similar problems such as feature tracking and motion estimation. Parallelizing such memory or data intensive workloads across multiple nodes is often challenging and non-trivial.
Data-intensive computer vision workloads (such as stereo vision and object and feature tracking) take one or more images as input with certain workload parameters that define how the one or more images should be processed. There are several challenges in efficiently parallelizing such a workload across multiple nodes.
For example, one approach splits the one or more images into tiles, distributes the tiles across several nodes, and performs parallel computation on the tiles. However, this requires multiple stages of synchronization (including communication of instructions) and non-negligible data communication between nodes. This may be manageable when the workload is parallelized across multiple cores within a single node, where the performance is only limited by the memory bandwidth and access latency or when only a small cluster of nodes (e.g., 2-4 nodes) is involved. However, in a multi-node setup, the amount of synchronization and data communication and the resulting overhead becomes a significant bottleneck. Furthermore, the shared network performance is often unreliable with communication latencies orders of magnitude higher than the intra-node off-chip memory latency.
Parallelization is also particularly difficult in situations when there is a high data movement-to-compute ratio. Although synchronization overhead exists for any workload, it is exacerbated in data-intensive computer vision workloads when the amount of compute per data moved is very low. In such cases, it is more difficult to hide high network latency behind compute costs when very small packets are exchanged that do not have enough compute to hide the associated network latency.
Another approach involves extending single node fine-grained parallelism across multiple nodes using general purpose inter-node communication primitives like send, receive, broadcast, scatter, gather, and reduce. This type of extension is inefficient especially for computer vision workloads that exhibit regular and large data communication patterns that cannot be efficiently implemented using the communication primitives. Thus, achieving scalability across multiple nodes becomes challenging and non-trivial for such data-intensive computer vision workloads.
Thus, there is a need for an approach that utilizes parallelism across cores and nodes and a specialized set of communication primitives between nodes for building a scalable platform for parallelizing data-intensive computer vision workloads.
While each of the drawing figures depicts a particular embodiment for purposes of depicting a clear example, other embodiments may omit, add to, reorder, and/or modify any of the elements shown in the drawing figures. For purposes of depicting clear examples, one or more figures may be described with reference to one or more other figures, but using the particular arrangement depicted in the one or more other figures is not required in other embodiments.