Increasingly it appears that no digital data processor can fit all applications. There are many dramatically differing requirements for various applications. Each digital data processor manufacturer is beset by a host of competitors with competitive, if not superior, solutions for portions of the application space. Combating this requires an architecture which is scalable, customizable and programmable.
The single most important enabler of DSP architecture development has been shrinking integrated circuit geometries. Smaller geometries permit a single integrated circuit to include more circuits. These additional circuits could include more computational units than integrated circuits of a prior generation.
There are two paths to enable use of greater number of circuits. The first path includes single control stream architectures on a single central processing unit (CPU). Such a single control stream CPU could provide circuits for greater exploitation of computational parallelism. Circuits of this type include very long instruction word (VLIW) architecture where wide issue instructions control simultaneous operation of plural independent functional units. Another variation is single instruction multiple data (SIMD) architectures where plural computational units perform the same operation on corresponding plural data instances. These architectures could take the form of additional datapaths including register files and corresponding functional units. These architectures could employ more complex functional units capable of greater computation complexity. These architectures could provide more functional units per datapath at the expense of increasing the number of data ports in the corresponding register files.
The Texas Instruments TMS320C6000 family of digital signal processors is an 8-way VLIW architecture divided into two symmetrical data paths. Scaling from the original 2 datapaths to 4 or 8 datapaths provides a natural extension of this architecture. Such an extension by data path replication may provide object code compatibility with the original architecture. The original compiler could be extended to search for and take advantage of additional parallelism.
Providing more computational capacity per functional unit could be done in many ways. The Texas Instruments TMS320C6000 family includes such a progression from the original 6000 series to the 6400 series and the 6700 floating point series. Additional computational capacity could be provided by: adding floating-point capability; enabling 32-bit multiplication as an extension from 16-bit multiplication; enabling complex number calculations; and merely making the functional units more similar thus making all functional units more powerful. In the big picture, these are all believed to be mere tweaks.
These approaches are quickly running out of steam. This is primarily determined by the limits of instruction level parallelism (ILP) in a single control stream. There is an open question how quickly they reach their natural limits.
The second path includes multiple control stream architectures. Multiple control stream architectures are of two types. The first type provides multiple program threads on the same central processing unit (CPU). Each thread is by definition independent and thus provides data processing tasks that can be performed independently in parallel hardware. This architecture provides good performance when the latency to access memory or registers is much greater than the compute latency. Such single CPU multi-threading have not been widely used. This technique is more specialized than the other approaches. Good compiler tools to properly match multi-threaded programs to particular applications are lacking.
The second type provides multiple threads on different CPUs. This technique provides an aggressive approach to problems. Since each task and CPU are relatively independent, existing compiler tools can generally be used. Multiple control stream architectures offer the promise of breakthrough performance. They provide an avenue to exploit task level parallelism. The primary question with these techniques is how they can be programmed.
There is a broad spectrum of multiprocessor techniques including: data flow; symmetric multiprocessing; distributed multiprocessing; multi-threaded machines; shared memory; message passing; shared memory with message passing; topologies like systolic, ring, two dimensional mesh and three dimensional mesh; fine grain; and coarse grain. The particular hardware is important but the programming model is critical.
There is a need in the art for solutions to this problem that are both digital signal processing (DSP) centric and capable of exploiting reduced semiconductor feature geometries for significant performance gain. Digital signal processing deals primarily with real-time processing in a manner not likely to be pursued by general purpose processors. There are many applications suitable to multiprocessors including most scientific computing and DSP. There are many multiprocessor architectures and many programming approaches for multiprocessors. The best solution is one that comprehends and is tuned for: the application; the architecture; and the programming.