The present application relates to high-performance multi-purpose computer processors, and in particular, to processors that include internal memory controllers that control accesses to external memory chips.
The number of processor cores in a multi-core processor has increased rapidly in recent years. The computing performance of the multi-core processors is becoming increasingly limited by the bandwidths of the multi-core processors' accesses to external memory chips. Let P denotes the overall bandwidth (in GB/s) of data and address pins of the chip, D be the data bandwidth (in GB/s), and A be the bandwidth of address information (in GB/s), we have P=D+A. Given a fixed total number of data and address pins (i.e. constant P) and a clock rate, the D-to-A ratio is fixed in conventional processor architectures.
In some designs, data and address buses (and pins) are separated. If there is only a single memory controller, most pins are used for transferring data, and D is almost equal to P. For example, if there are 256 data pins connected to DDR3 chips (with consecutive data transmissions of burst length 8), the ideal granularity of memory accesses is 256 Bytes. Shorter memory accesses will not fully utilize the bandwidth. Most applications in cloud computing and transaction-processing use 4˜8 bytes (integers and floats) per operation. For a computing task with mostly 8-byte accesses, data utilization is only 3% for such a single memory controller configuration.
Adding memory controllers with independent address buses decreases the above described memory-access granularity, requires more address pins, and reduces the data bandwidth D. The D-to-A ratio remains fixed. For example, 32 memory controllers are needed to reduce 256-byte granularity to 8 bytes per memory channel. Addressing GB-scale memory requires up to 32 bits for addressing. Theoretical limit for D is 67% of P (64 bits/(64 bits+32 bits)=67%). In reality, when the timing for addressing in typical DDR3 is considered, the actual data bandwidth D for computing tasks is below 50%.
In another design that is commonly seen in low-end processors in embedded systems, data and address buses are shared and reused alternately for data or addressing purposes over time. This type of processors has the benefit of simple packaging configuration with the total pin count close to the number needed for addressing only. However, since the D-to-A ratio is also fixed, this type of design has the same drawbacks as the previously described designs; it also cannot provide both addressing and data performances.
Other techniques have attempted to enhance the utilization of bandwidth by exploiting data locality, but have only achieved limited improvements. These techniques assume the conditions of fixed memory access granularity and constant D/A ratio. They depend on cache hierarchies to store perfected or to-be-reloaded data. However, cache is not scalable in multi-core processors: more processor cores in a multi-core processor results in smaller cache per processor core, which decreases hit rate. Moreover, during the short span of a cache, it is unlikely that the cache line acquired by one processor core happen to be requested by another processor core. As a result, caching becomes less effective when the number of processor cores is increased. In General Purpose Graphics Processing Unit (GPGPU), every 32 threads are grouped into a warp unit to execute vector instructions in a Single-Instruction Multi Data (SIMD). It enables, within a single task, alternate accesses to different memory addresses among multiple threads, without explicit data exchanges among threads in source codes, thus simplifying programming. But this approach is only effective for large data blocks each associated with continuous addresses (for GPGPU, 32 bytes typically). In practice, many processing tasks have data blocks less than 8 bytes with distributed memory addresses, which leads to very low data bus utilization. Consequently, although GPGPU technology can rarely double the performance of graphic computing (such as finding the shortest path between two points), even if it can speed up scientific computing by tens of times. In conclusion, if the D/A ratio is high at the processor-memory interface, data exchange within the multi-core processor can enhance data utilization and simplify coding, but cannot solve the problem of low memory bandwidth utilization in the presence of distributed access patterns.
There is therefore an urgent need for improving computing performance of multi-core processors in different types of computation applications.