The effective computational speed of a computer depends both on the speed of the central processing unit (“CPU”) of the computer and the speed at which data can be retrieved from external memory to the CPU. The phrase “memory wall” refers to the growing disparity between the computational speed and the memory retrieval speed of computers. This disparity occurs in part because of the limited communication bandwidth beyond chip boundaries. CPU speed has traditionally increased at a much greater rate than memory retrieval speed. Because of this trend, it is expected that memory latency may become an overwhelming bottleneck in computer performance.
The disparity is further increased because many computers now use multiple core architectures with heterogeneous computational units (e.g., graphics processing units). Because of spatial parallelism, the use of multiple cores enables continued improvement in peak floating-point operations per second (“flops”), which is a measure of computational speed. To mitigate the increasing disparity between the computational speed and memory retrieval speed, multiple core architectures incorporate deep cache hierarchies to increase the likelihood that memory accesses by an application will be satisfied from a cache. Many memory-intensive applications, however, might not benefit from cache hierarchies because they have little spatial or temporal access locality. In addition, because caching techniques retrieve complete cache lines, memory bandwidth and power are wasted by retrieving data that is not used. For example, if a cache line is 64 bytes and an application uses only 8 bytes of each cache line, the caching technique results in eight times the amount of data that is used by the application being retrieved. Also, for these applications, the computational units often sit idle waiting for the next cache line of data to be retrieved. For example, the popular PageRank algorithm represents a graph of nodes representing web pages and edges representing links between web pages as a sparse matrix. The algorithm accesses random locations within the matrix and within a vector with an entry for each web page. Although direct memory access (“DMA”) and scatter/gather hardware integrated with the CPU can help gather the needed data, all the data still needs to be retrieved from the memory. The memory wall thus has the potential to severely limit the ability to analyze expanding data volumes.
To help overcome the memory wall, designs have been proposed to integrate processing logic with memory. Some of these designs were based on integration of the processing logic in the fabrication process of dynamic random access memory (“DRAM”) cells. Because of the cost of integrating the processing logic with DRAM cells, such designs have proved to be commercially unfeasible. More recently, other designs have been proposed to include the processing logic of a central processing unit (“CPU”) in a separate logic layer of a 3D-memory package. With such designs, computations of a host CPU can be offloaded to the 3D-memory package. The fabrication of logic layers with the processing logic of a CPU is both complex and expensive.