Conventional general purpose programmable processors executing sequential streams of instructions address different criteria across diverse problem domains. Aggressive pipelining and out-of-order execution have been applied to extract parallelism from the instruction streams, but the gains are limited by inherent data and control dependencies. To further improve performance, single instruction, multiple data (SIMD) instructions and multiple cores executing parallel instruction streams have been used. However, the resulting implementations tend to be inefficient and do not exploit detailed understandings of the problem domain to optimize data paths and dynamic random access memory bandwidth. Furthermore, the implementations are difficult for compilers to extract the inherent parallelism of processes expressed in conventional programming languages. The demanding specifications of real time computer vision processing drives toward a more power and cost efficient implementation using specialized engines for computer vision tasks. However, computer vision tasks are rapidly evolving and the specialized engines are not commonly flexible to support new or enhanced tasks.
It would be desirable to implement a scheduler and memory hierarchy in a coprocessor.