Advances in processor technology have allowed for significant increases in processing speed. However, in applications that are intensive in off-processor chip memory accesses, such as speech, signal, and image processing applications, the gain in raw processing speed is often lost because of relatively slow access times to the off-chip memories. This problem is further aggravated since memory technology has focused on increased device density. With increased device density, the maximum bandwidth of a system decreases because multiple bus architectures are defeated. For example, a graphics application requiring storage of a 480.times.240 sixteen-bit image has four times the bandwidth if eight 256K memory chips are used, rather than two of the more dense 1 megabyte chips.
Several strategies have been proposed to overcome these difficulties. One such solution involves using an application specific integrated circuit ("ASIC") to offload time-intensive tasks from the host CPU to increase overall system throughput. This alternative, however, requires one ASIC for each function to be offloaded, and requires dedicated memory for each ASIC. Consequently, a higher overall system cost is involved, and the system throughput is increased only for those tasks for which the ASIC was designed to handle, and not for tasks in general.
Another alternative involves the use of a co-processor. Such a solution allows for tasks to be offloaded from a host CPU and allows system memory to be shared by both the host CPU and the co-processor. With this system, however, total system bandwidth is decreased because of arbitration between the host processor and the co-processor. Furthermore, well-developed software is required to make full use and provide for "seamless integration" of the co-processor.
Another alternative involves the use of an application specific processor for offloading tasks from a host CPU. This alternative may require an expensive dedicated static RAM ("SRAM") for use by the application specific processor. Thus, this alternative involves increased system cost. Furthermore, the SRAM is not available even when the attached application specific processor is idle, and well-developed software is needed for "seamless integration".
As another solution to these difficulties, significant research and effort has been directed towards multiprocessing systems for increasing throughput as the limits of decreasing processor cycle times are approached. However, difficulties in designing multiprocessing systems, developing communication protocols for such systems, and designing software support routines have deterred proliferation of multiprocessing systems. Nonetheless, many applications in signal, speech and image processing are structured and lend themselves to partitioning and parallel processing.
These problems present themselves in many environments, and a particular area in which incrased processor to memory bandwidth is critical is graphics and imaging processing, since significant amounts of memory and associated data processing are required.
Thus, a need has arisen for a device and method allowing for execution of several self-contained graphics and imaging tasks in parallel within existing architectural frameworks. Furthermore, a need has arisen for improving processor to memory bandwidth in graphics and imaging applications without significant cost increases and without requiring customized, specific solutions for increasing system throughput.