As more applications, continue to take advantage of the parallel processing capabilities of multi-processing systems and microprocessors, there is a growing need to facilitate greater memory bandwidth. Parallel applications can include graphics applications, financial applications, medical and biotechnological applications, or any other application that involves operating on large sets of data concurrently, via, for example, single-instruction multiple-data (SIMD) instructions. To some extent, more traditional, sequential central processing unit (CPU) workloads, may also require or otherwise benefit from greater memory bandwidth and data bus sizes, depending on the size of the data structures on which they operate.
Graphics applications, for example, tend to perform texturing operations or other effects on many pixels of a polygon or polygons in parallel to render a three-dimensional (3D) graphics scene. The size of some textures, or other large data structures, may require, or otherwise create a need for, high bandwidth from one or more processors to one or more memory storage areas (e.g., DRAM) to retrieve and store this data quickly. Some prior art techniques have attempted to provide greater memory bandwidth by increasing the number of pins or bus traces from one or more processors or processing cores to one or more memories. Increasing interconnect widths, such as the off-package bus width, to increase bandwidth can adversely affect system cost and can constrain the applicability of the system to more general purpose computing systems.
In some prior art techniques, increasing memory bandwidth can be done by increasing the bandwidth (vis-à-vis increasing switching frequency) of each data pin and/or adding more data pins to the package. However, there are practical (e.g., economic) limits to increasing bandwidth through increasing bus width (e.g., by adding more pins) and/or increasing bus frequency.
To further increase system bandwidth, some prior art techniques may use multiple processors with a corresponding memory allocated to each processor. This creates a pairing between the processors and the allocated memory, which are typically interconnected by a high bandwidth bus. Processor/memory pairs may then be connected to each other by another bus, which may require additional pins, but may not have the bandwidth to support sharing of data fetched by each processor from its corresponding memory. Because of the difficulty in sharing information accessed by one processor from one memory to another processor in an expedient manner, applications may attempt to partition work performed by the application between the processor/memory pairs. Partitioning an application can present a significant burden to application developers, as they need to make sure they are storing and accessing data within the proper processor/memory pair to avoid significant latency. Placing constraints on applications, like code/data partitioning, can increases application development costs, inhibits portability, and prevent these applications from being more successful in the marketplace.