The present application relates generally to an improved data processing apparatus and method and more specifically to an apparatus and method to improve multiple smaller direct memory access transfer efficiency for programming on a Cell Broadband Engine™ processor.
The Cell Broadband Engine™ architecture combines a general-purpose Power Architecture® core of modest performance with streamlined co-processing elements that greatly accelerate multimedia and vector processing applications, for example. The co-processing elements, referred to as synergistic processing elements (SPEs) greatly accelerate many forms of dedicated computation. The Cell Broadband Engine™ (Cell/B.E.™) processor is a single chip multi-core architecture that provides low cost and high performance a wide range of applications from computer games to many compute-intensive applications in various application domains, such as seismic data processing, bioinformatics, graph processing, etc. CELL BROADBAND ENGINE and CELL/B.E. are trademarks of Sony Computer Entertainment, Inc., in the United States, other countries, or both. POWER ARCHITECTURE is a registered trademark of IBM Corporation in the United States, other countries, or both.
One feature provided by the Cell/B.E.™ architecture is the use of eight synergistic processing elements (SPEs), for example, each using its own local store (LS) rather than a cache. This requires Cell/B.E.™ programmers to explicitly use direct memory access (DMA) instructions to transfer data and/or results between the main shared memory and the local stores. This feature provides Cell/B.E.™ programmers with more flexibility, as well as more complexity.
Cell/B.E.™ programmers must then consider the efficiency of using the LSs, which greatly affects the actual performance of an application. How to efficiently use LSs is a very important issue a Cell/B.E.™ programmer should consider. Due to the limited space of each LS, it is common to perform frequent DMA transfers back and forth between main memory and the LS. Therefore, the strategy for performing DMA transfers may be critical to an application's performance.
One common technique currently used is called “double buffering.” The basic idea of double buffering is to hide the DMA transfer latency by switching the computation between two buffers, thus performing DMA transfer on one of the buffers while performing computation on the other. For compute-bounded applications, this technique can effectively hide DMA transfer latency.
Nonetheless, there are many applications that are memory-bounded, and in those cases the performance of the application is not limited by the computing power of each SPE, but by the DMA transfer bandwidth. Thus, any time wasted on DMA transfers may negatively affect the application's performance. To get better performance and efficiency, a programmer may try to perform as few DMA transfers as possible with as large a transfer size as possible. However, due to the nature of many computing algorithms, one may not be able to avoid multiple smaller DMA transfers, making double buffering the only technique for improving performance.