A multi-core computing system typically includes some combination of shared memory units, accessible by all cores, and/or local memory units, associated with individual cores. Most of the cores, although not necessarily all, access these memory units using direct memory access (DMA). Access to local memory units may be direct and/or some cores may have direct access to the shared memory units. Further, there may be a different path between the unshared memory units (e.g., hand-carried coherence).
In high-performance computing (HPC) applications, particularly mathematical libraries such as those involving, for example, linear algebra and Fast Fourier Transforms (FFTs), automatic code generation techniques have been widely used. Such code generation techniques typically utilize code generators which search a large parameter space to determine the set of parameters (e.g., how much to loop unroll by, block sizes/sub-problem sizes to use, etc.) which provides optimal performance for a given underlying platform.
One known optimization technique to reduce the parameter space searched by the code generator is to first determine the hardware parameters of the underlying architecture and then limit the search parameters based on these underlying hardware parameters. As a specific example, once the cache size of a given platform is known, a matrix transpose code can limit the space of block sizes to transpose so that the loaded block resides in the cache. Unfortunately, however, these conventional techniques work offline and generate optimal code for fixed configurations (Fast Fourier Transform in the West (FFTW), a C subroutine library, may work dynamically at run-time, but that is only useful if the plan (i.e., outcome) is to be reused multiple times; otherwise it is more beneficial to store and reuse the plan rather than run it every time). Moreover, these techniques do not take into account optimizations possible with regards to DMA operations (e.g., they do not search the DMA parameter space).
DMA operations can have significant impact on the performance of applications. Some of the issues involved include the following:                DMA operations tend to have high latencies, discouraging working iteratively on small blocks/vectors.        Performance of DMA lists is often not as good as that of contiguous DMA. Therefore in certain cases, it is beneficial to perform contiguous DMA operations, even if that means fetching unwanted data.        Performance of DMA lists often degrades with decreasing size of each list operation which discourages working on small blocks.        Interactions between DMA requests originating from different processing cores often has a degrading effect on the performance of the system (both locally and globally).In single-ported local memory units, the DMA operations can undesirably interfere with the computation, thereby impacting the performance of the algorithmic task(s) being performed (for instance, the core could starve for instructions if DMA is given higher priority than the local memory unit).        
Accordingly, there exists a need for techniques for evaluating the performance of algorithmic tasks that use DMA for data transfer that do not suffer from one or more of the limitations exhibited by conventional approaches.