A video algorithm can often be broken down to a series of simple basic algorithms. For example, an edge detection algorithm can be broken down to ‘convolve’, ‘add’, etc. A video accelerator library may accelerate these basic algorithms when executed on a graphics processing unit (GPU).
At least two factors can affect performance when one GPU task is broken down into many GPU tasks. A first factor is overhead associated with data transference between the GPU and a host processor. A second factor is overhead associated with setup of the GPU tasks. For example, a single GPU task needs one setup. After the task is broken down to several tasks (although each task is very small), each GPU task needs to be set up. Each factor can result in increased latency associated with task performance.