Computing workloads in the emerging world of “high definition” digital multimedia (e.g. High-Definition Television (HDTV) and High-Definition-Digital Versatile Disc (HD-DVD)) more closely resemble workloads associated with scientific computing, or so called supercomputing, rather than general purpose personal computing workloads. Unlike traditional supercomputing applications, which are free to trade performance for super-size or super-cost structures, entertainment supercomputing in the rapidly growing digital consumer electronic industry imposes extreme constraints of both size, cost and power.
With rapid growth has come rapid change in market requirements and industry standards. The traditional approach of implementing highly specialized integrated circuits (ASICs) is no longer cost effective as the research and development required for each new application specific integrated circuit is less likely to be amortized over the ever shortening product life cycle. At the same time, ASIC designers are able to optimize efficiency and cost through judicious use of parallel processing and parallel data paths. An ASIC designer is free to look for explicit and latent parallelism in every nook and cranny of a specific application or algorithm, and then exploit that in circuits. With the growing need for flexibility, however, an embedded parallel computer is needed that finds the optimum balance between all of the available forms of parallelism, yet remains programmable.
Embedded computation requires more generality/flexibility than that offered by an ASIC, but less generality than that offered by a general purpose processor. Therefore, the instruction set architecture of an embedded computer can be optimized for an application domain, yet remain “general purpose” within that domain.
The current implementations of data parallel computing systems use only one instruction sequencer to send one instruction at a time to an array of processing elements. This results in significantly less than 100% processor utilization, typically closer to the 20%-60% range because many of the processing elements have no data to process or because they have the inappropriate internal state.