There are many examples of processors that handle highly repetitive complex tasks. Examples of processors include media and digital processors that are used for computations of repetitive signal processing operations such as filtering or scaling. Typically, processors operate on multiple chunks of data either through SIMD and/or VLIW techniques that are well known and understood in the industry. For example, the MMX instruction set used in the Intel's Pentium Architecture can operate data at 8/16/32-bit resolutions. As a result, for some of the 8-bit operations, it can perform 4 8-bit operations in parallel, thereby improving the execution efficiency accordingly. Similar operations are true for most DSP and Media processors in the industry.
There are several disadvantages with this approach, especially for repetitive tasks that span across multiple operations. For example, every task takes a few instructions to be defined and some tasks require several instruction in order to configure the system. Even though the sequence of instructions will be repetitive over a frame of data, these processors cannot really take full advantage of that usage pattern since the processor is configured for an individual operation even though the operation is repetitive. The advantages that are exploited are Caching, zero-overhead loops etc, but nothing more than that. Furthermore, every instruction, which is within the sequence, goes through a complete pipeline stage as defined by the respective processor architecture (instruction fetch, decode, etc) independent of the fact that a very small amount of instructions are highly repeated. Still further, instructions are fetched from some form of Memory (RAM, ROM, Cache, etc), which causes unnecessary data movement, bandwidth, etc. Finally, sequencing between task iterations and between different operations of a given task is done in software.
For example, there are some operation that have a short configuration time and a longer performance of operation time. Accordingly, this configuration is useful for block based processing and large work loads. However, this type of system does not perform efficiently when using smaller work loads due to the time required to configure the system for such short operations. For example, in the case of a smaller work load, the processor require a large amount of set-up time relative to the actual workload for each set-up. Thus, such a system that requires a larger amount of processor time for configuration of the co-processor is not an efficient use of processor time.
There are also system that require very little processor time to configure, but these systems are limited to small work loads or simple tasks that require very little configuration time because the operation being performed is very similar to the previous operation and, hence, very little processor time is required to set up the system.
Therefore, what is needed is a system and method for handling operation that includes larger work loads and repetitive tasks while allowing for sequencing of data between task iteration or different operations.