Unlike generic data processing, image and video processing demands high degree of parallelism. To perform high degree of parallelism, a domain specific knowledge is required to develop a complex computer software. These domain experts focus on the overall performance of the application and lacks knowledge in the system level performance like parallelism or flexibility based on user defined application etc. So, traditional sequential instruction processing as happening in microprocessors is inefficient. This will greatly reduce the processing efficiency. This causes various unwanted results like data loss, low frame rate, delay in image/video analytics results for intelligent decision making, inefficient power usage for applications that demand low power etc.
Conventionally, multiple approaches have been developed to solve these problems. Applying generalization, the approaches falls under three categories. All the three approaches, incorporates a main processor (or processors) for running the main application software, configuring other compute resources and handling the results out of the overall compute system.
First approach is to use non-standard highly parallel hardcoded hardware structures implemented in Application Specific Integrated circuits (ASICs) or System on chips (SoCs) or Field programmable gate arrays (FPGAs) in the form of Register transfer level (RTLs) implementation methods. These RTL generated digital gate design structures work alongside with main-CPUs. This approach directly implements hardcoded data processing structures and achieve very high degree of parallelism. This approach is efficient in terms of performance and power. But it comes with a penalty of losing flexibility and programmability. This scheme lacks capabilities to add heterogeneous compute engines in a plug and play fashion. This requirement is very important in physical implementations such as custom ASICs and Field programmable gate array (FPGA) that allows programmability for application developers by using EDA tools. Now the popular approach in ASICs and FPGAs is to use new or reusable computational blocks by stitching them together using RTLs or by using schematic design entry methods. This demands a major design skill from the application developer's point of view to create a suitable ASIC or FPGA design to meet their demand.
Second approach is to use multiple symmetric compute structures like Single Instruction Multiple Data (SIMD) or Very long instruction word (VLIW) processors to work in parallel to achieve certain degree of parallelism. The SIMDs and VLIWs work alongside with main main-CPU. This approach is simple in programming model point of view but inefficient in terms of parallelism. Considering the nature of image data volume and organisation, the requirement of number of SIMDs and VLIWs could be very high to achieve reasonable frame rates. The communication between processors and sharing data between the processors is through interrupts and traditional inter-processor communication mechanisms like mail box and often results in performance bottlenecks due to communication overhead and programming complexity.
Third approach to this problem is to use a mixed approach. Here, main-CPUs and SIMDs/VLIWs coexist along with fixed pipeline of hardware block specialised to do certain tasks efficiently. SIMDs solves the problem of programmability and flexibility while fixed function pipeline blocks help to achieve image/video specific high degree parallelism for certain functions (e.g. convolution). This approach solves majority of the issues related to image/video processing.
Though this third approach is efficient compared to the other two approaches, most of the implementations of this approach are based upon using data flow based on traditional inter-processor and inter fixed-function communications. The main processor(s) or SIMD/VLIWs processors use fixed-function pipelines (hardware) for specific tasks. Because of this rigid communication and dataflow management, the implementation of combining the fixed-functions and SIMDs/VLIWs limit the overall performance of the system. This is also an inefficient approach in terms of silicon area and power utilization point of view. Further, this scheme lacks capabilities to add heterogeneous compute engines in a plug and play fashion for an FPGA or ASIC hardware implementation.
Typical GPUs (graphics processor units) are using SIMD/VLIW based parallel processing. Intel & NVIDIA GPUs are examples. But their communication flow and methods are more general purpose or targeted for graphics type of data processing. Most of them address SIMD to SIMD direct relationship and traditional communication schemes (mail-box, interrupt etc.) to achieve parallel processing. Image/Video processing is more compute intensive and nature of data (arranged like a 2 dimensional array in main memory) is different as compared to graphics processing. So, it demands a completely different approach. There are some new image processing architectures that try to include some fixed function blocks (image processing functions) along with SIMDs/VLIW and create some communication methods to achieve high performance. But they also having disadvantages as programming models are still very restrictive and not able to achieve the amount of parallelism that image/video processing demands. Another approach is using FPGAs or ASICs to address high performance requirements of the computation. Now, the popular approach is to use new or reusable computational blocks by stitching them together by using RTLs or schematic design entry methods. But main lapse in this approach is, it demands a major design skill from the application developers side to create a suitable design to meet their demand.
Many system and methods are known in the existing art that uses graph based execution for heterogeneous compute resources of image or video processing to achieve. U.S. Pat. No. 9,710,876 to Stewart N. Taylor, entitled “Graph-based application programming interface architectures with equivalency classes for enhanced image processing parallelism” deals with an image graph executor, which schedules the heterogeneous hardware resources in parallel. The heterogeneous devices having different instruction set architectures. Here, the pipeline of image processing operations are performed with different instruction set architectures. In this art, a mechanism is used for implementing graph on the heterogeneous hardware resources. The graph executor optimizes the tile of an image.
U.S. Pat. No. 9,348,560 to Binglong Xie et al., entitled “Efficient execution of graph-based programs” deals with the heterogeneous hardware resources using graph based execution for image/video processing. A mix of various processing units such as RICA is used for parallel computing process based on programming. This program is based on the graph associated with the parallel hardware configuration. Programming the heterogeneous hardware resources for physical implementation such as FPGA, ASIC using computer files (e.g., RTL, GDSII, GERBER, etc.) provided to fabrication handlers.
U.S. Pat. No. 9,569,221 to Apoorv Chaudhri et al., entitled “Dynamic selection of hardware processors for stream processing” deals with the heterogeneous parallel computing systems processing streams of data using graph based execution. Here, the SoC have multiple hardware processors such as SIMD, hardware kernels. In order to perform complex tasks that require sequences of analytical and processing operations, stream processing tools may be logically arranged in sequences referred to as processing pipelines or tool chains.
U.S. Patent Application No. 20160147571 to Rèmi Barrere et al., entitled “Method for optimizing the parallel processing of data on a hardware platform” deals with the system that includes plurality of processing units. An application graph defining the processing of data on the plurality of processing units. The parallel processing of data is optimized by programming code instructions. The hardware platforms communicate using message passing. These programming code instructions implemented on the hardware platform.
U.S. Pat. No. 9,430,807 to Alexei V. Bourd et al., entitled “Execution model for heterogeneous computing” deals with the graph based pipeline execution topology in heterogeneous computing systems for image/video processing. The heterogeneous computing includes GPU configuration that includes SIMD, hardware kernels. The processor receives an indication of pipeline topology and generates instructions to GPU for execution.
U.S. Patent Application No. 20160093012 to Jayanth N. RAO et al., entitled “Method and apparatus for a highly efficient graphics processing unit (GPU) execution model” deals with the heterogeneous cores have plurality of child workloads interconnected in a logical graph structure for image/video processing. The compute engines includes SIMD, pipeline and kernels. This graphics pipeline is configured based on the pipeline control commands. Message passing is done among the major components of the graphics engine. These codes and data are stored in the hardware unit for executing work.
Chinese Patent Application No. 102707952 to Zhou Jun et al., entitled “User description based programming design method on embedded heterogeneous multi-core processor” deals with a heterogeneous multi-core processor with different processor cores that has task relation graph (directed acyclic graph (DAG) for task execution. The parallelism in heterogeneous multi-core processor is based on kernel frame code. There is a message queue communication between the task and heterogeneous multicore processors has an embedded programming based on user description.
PCT application No. 2016177405 to Natan Peterfreund et al., entitled “Systems and methods for transformation of a dataflow graph for execution on a processing system” deals with a heterogeneous system that includes SIMD and has data flow graph for processing where parallel execution of process based on dataset. The data flow graph of a computer program is implemented on processor.
A non-patent literature “Exploiting the Parallelism of Heterogeneous Systems using Dataflow Graphs on Top of OpenCL” to Lars Schor et al., relates to a heterogeneous system includes SIMD, kernels has synchronous dataflow graphs for video processing where parallel execution of process based on program and this high-level programming framework is implemented on heterogeneous system.
A non-patent literature “Supporting Real-Time Computer Vision Workloads using Open VX on Multicore+GPU Platforms” to Glenn A. Elliott et al., relates to a heterogeneous hardware platform with pipeline architecture and kernels has graph-based computation and this pipeline execution is based on software. The graph based software architecture designed on heterogeneous platforms.
Though the existing systems and methods relates to heterogeneous systems that uses a graph based dataflow execution for image/video processing in parallel. The parallel processing of data is optimized by programming code instructions. But none of the prior art discloses about the nodes which acts as proxy at intermediate stages between compute engines. Also, none of the prior art discloses about the reusability of compute engines.
Hence, a need exists in the art for a system or technique for efficiently utilizing silicon to achieve high performance at low power without compromising on flexibility and programmability
The present invention proposes a system and method to achieve graph based execution flow for image and video processing. The inventive system comprises a scheme to interconnect heterogeneous compute resources in a user programmable graph based data flow, reusing compute resources by time multiplexing and context switching. During this time, a commit and release messages are generated to intimate the start and stop of the particular instruction to increase reusable capability of compute engines.