In both embedded systems and general purpose computing, a high demand for computing power exists. This demand will continue to increase with increasing system complexities and the trend to address more and more problems with digital solutions.
One solution to satisfying such demand is the exploitation of instruction level parallelism (ILP) in, for example, very large instruction word (VLIW) processors, single instruction multiple data (SIMD) processors, superscalar processors, and their variants. These approaches are limited by the available parallelism in sequentially written programs. In general, instruction level parallelism has been found not exceed a level of about six instructions per cycle.
Another solution to satisfying processing demand is to write parallel programs for homogeneous or heterogeneous parallel processors. Although practiced for many years, this approach has not achieved wide acceptance due to the complexity of parallel programs, making development extremely costly. The high development cost severely limits the range of applications that may economically employ this approach. In addition, use of heterogenous processors necessitate complete re-writes of the program for each processor configuration, and this type of architecture is typically limited by bandwidth restrictions between processors and memories.
Yet another solution for high-performance systems is pipelining several stages of a computation, and efficient approach that unfortunately lacks flexibility and, more importantly, scalability.
Independently, scheduling of processing for embedded systems using real time operating systems (RTOS) has been found to require significant over-engineering of the hardware necessary to support applications, due to both the overhead introduced by an RTOS and inefficient scheduling by the RTOS.
There is, therefore, a need in the art for an improved processing architecture supporting high processing and communication requirements. It would further be desirable for the architecture to provide a platform of modular component that may be assembled and scaled to meet diverse system requirements. The solution of the present invention involves running sequentially written programs in a manner benefiting from techniques developed for task level parallelism (TLP), with each task benefiting from experience developed in instruction level parallelism to thus benefit from both coarse grain and fine grain parallelism without the need to write parallel programs. Further, the present invention eliminates the need to used RTOS schedulers for task and resource scheduling, and can also organize heterogenous parallel processing in a flexible and scalable way by dynamically combining parallel and pipelined execution.