Processor designs are moving towards multiple core architectures where more than one core (processor) is implemented on a single chip. Multiple core architectures provide increased computing power while requiring less space and a lower amount of power. Multiple core architectures are particularly useful for pipelining instructions in applications that require high processing speeds, such as packet processing in networks which may require processing speeds of up to 10 Gigabits per second. The instructions may be pipelined, for example, into stages where each stage is supported by a different processor or processor core.
The performance of pipelined computations as a whole can be no faster than the slowest of the pipeline stages. For this reason, when pipelining instructions, compilers attempt to balance instructions among stages as evenly as possible. It is common for compilers to partition instructions between stages based upon the compute cycles required for executing instructions. This technique may be effective in some instances. However, when the instructions include a large number of memory accesses, the latency required for completing some memory accesses may produce additional undesired delay that is not accounted for by the compilers. For example, while the latency of two independent memory accesses may be overlapped with each other, instructions that depend on the completion of a particular memory access operation cannot be executed until the memory access is completed. Hence, instructions with dependencies on memory access operations cannot be overlapped with the latency of the memory access.
Thus, what is needed is a method and apparatus for partitioning programming to balance memory latency.