Typically, data processing may be performed by splitting tasks between software and hardware. For example, certain tasks are complex but not necessarily computationally intensive and therefore performing the tasks with software is sufficient. Other tasks are more computationally intensive and therefore performing the tasks with hardware is more efficient. Parallel processing may be used to speed up the processing of computationally intensive tasks. Parallel processing may be used in video processing, gaming, complex mathematic modeling, video conferencing and/or other applications.
Various central processing units (CPUs) and hardware acceleration modules are typically connected together through an on-chip bus. Data transfer between the CPUs and/or the hardware acceleration modules requires relatively high bandwidth. Increasing the number of buses or the bandwidth of the on-chip bus increases the bandwidth available for data transfer. However, this approach is typically not cost-effective. For example, multi-channel video processors that are capable of processing multiple video streams have high bandwidth requirements.
Referring now to FIG. 1, a data processing system architecture 10 is shown. The data processing system 10 includes processor modules such as CPUs 12-1 and 12-2, referred to collectively as CPUs 12, and hardware acceleration modules (i.e. hardware processing modules) 14-1 and 14-2, referred to collectively as hardware acceleration modules 14. The CPUs 12, the hardware acceleration modules 14, a dynamic random access memory (DRAM) module 20, and a DRAM controller 22 communicate over a communication bus 24. The CPUs 12, hardware acceleration modules 14, communication bus 24, DRAM module 20, and/or a DRAM controller 22 may be included on a printed circuit board (PCB) or integrated by a system on a chip (SOC) 26. The CPUs 12, the hardware acceleration modules 14, and the DRAM controller 22 communicate with a memory module 28 via the communication bus 24. The processing performance of the processing system 10 is limited by the capabilities of the communication bus 24. In other words, processing speed is limited by the speed and/or bandwidth of the communication bus 24.