With the improvements in process technology and computer architecture, higher number of processing elements are being packed into a single chip in the form of more cores in a multicore, more LUTs in an FPGA, or more special purpose blocks in an ASIC. As a result of this increase in the computational capacity, coupled with highly parallel processing, the pressure on the memory system is also increased. However, memory system performance cannot scale as well as the processing elements and is becoming the major bottleneck in application-specific hardware performance as well as general-purpose computing system performance. As a result, there is a need for scalable memory systems that can provide lower access latency and higher bandwidth within a power budget, in order to catch up with the demands of increased parallelism.
There are various causes of the scalability problem of memories. First, larger memories suffer larger access latencies and are more power hungry, due to their larger wire capacitances. Second, having only one memory (or a small number of memories) limits the maximum bandwidth provided by the memory system and the achievable parallelism. Third, as the number of ports to a memory is increased, its access latencies and power consumption are also increased.
In the presence of multiple parallel memory accesses (as in the case of a large general-purpose multicore computer), these causes for non-scalability of memories can be addressed with the typical approach of having a single, unified main memory, as well as a number of cache memories between the memory and the hardware modules. These caches comprise only some entries of the memory, and therefore, are smaller. Their smaller size makes them faster and multiple caches can provide higher bandwidth than the memory. However, an entry in the memory can exist in multiple caches simultaneously and the contents of all copies of a memory location must be consistent. In other words, these caches must be kept coherent.
With existing technologies, coherence can be imposed by means of a coherence protocol that requires messages to be transferred between the caches. These messages are delivered on a network that connects all caches that need to be kept coherent. Some of the most frequently used cache coherence mechanisms are snoopy (FIG. 1) and directory-based (FIG. 2) cache coherence mechanisms [10].
In the snoopy cache coherence mechanism, a memory request (101) emanating from a load or store instruction executed within a hardware module is first searched for a hit in the cache directly connected to the requesting module (102). If this cache generates a miss, it notifies the other caches (103-106) over a coherence network (107). Other caches, continuously snooping on the interconnection network, detect this notification and try to serve the request. If all caches in the coherence group (108) indicate a miss, then the request is delivered to the next level of memory through a memory switch (109). This switch connects the caches with the input ports of the next level (110) in the memory hierarchy.
In the case of directory based coherence mechanism, a miss on a cache (201) is directed to a coherence network (202) to be delivered to the corresponding directory (203). A directory entry for a cache block contains the list of caches that cache block exists in. If the directory indicates that the block does not exist in any cache, a miss is generated. The miss is delivered from the directory to the next level memory (204) over a memory switch (205). If the cache block exists at some other cache, that cache forwards the block over the coherence network.
FIG. 1 and FIG. 2 show only two particular implementations of caches. These caches have only one input port per cache and have only one bank (they are one-way interleaved). In general, a cache can be shared by multiple hardware modules, each having its own port to the cache1. Furthermore, a cache can comprise multiple interleaved banks that can operate in parallel. A multi-bank interleaved cache shared by five modules (301) is depicted in FIG. 3. In this shared cache, ports (302) are connected to the banks (303) using a shared port-to-bank network (304). There is an internal memory switch (305) that transfers the bank misses to the external memory switch between the cache and the next level memory. 1If two ports of a memory can be proven not to be accessed simultaneously, then these two ports can be merged into a single port. However, this is an exceptional case, and therefore, this technique does not solve the scalability problem of memory systems.
Although using distributed memory modules improves system performance, it still does not entirely solve the scalability problem. Specifically, the memories need to be kept coherent by means of high connectivity coherence networks. Such networks are not scalable and increasing the number of input and output ports of these networks greatly increases the access latencies.
Barua et al. [1] proposed a compiler-managed memory system for a specific processor architecture that consists of multiple memory modules, called tiles, which can operate independently. In this work, the compiler is exposed to the whole memory system and decides on the actual layout of data in these tiles. After static analysis of the program, the compiler finds sets of objects that can be accessed independently and places them to different tiles. Using the same technique, Babb et al. [2] showed how application-specific integrated circuits with distributed memories can be compiled from a sequential program specification.
Both of these works aim to find a partitioning of a single, large memory into multiple, smaller memories. This partitioning is computed only once by analyzing the whole program and remains fixed throughout the whole program. However, programs consist of multiple scopes (e.g., procedures, loops, begin-end blocks) and different program scopes can have different optimum partitionings. For example, a single memory can be split into multiple smaller memories only during the execution of a program region (such as a loop iteration), where the multiple small memories allow parallel simultaneous access with lower power and better latency. The method in this invention can partition memory at different program scopes and construct a multi-level memory partitioning whose shape can change dynamically at run-time.
Furthermore, the compilers in the aforementioned work either generate a distributed software program or an application-specific hardware circuit, neither of which has a requirement for an address to be translated between the software domain and the hardware accelerator domain. An accelerator hardware circuit, on the other hand, requires coherence not only within the software and within the hardware domains, but also across the two domains. For the case of hardware accelerators compiled from a software code fragment, the final effect of the hardware accelerator on memory must be functionally 100% compatible with the software code fragment that the accelerator replaces. Therefore, the single address space view of software must be preserved while accessing the memories in hardware. The method in this invention preserves the original view of the program address space. All extracted memories in hardware are accessed without changing the corresponding addresses in software.
Other works in the literature targeting memory partitioning focused on logical partitioning and targeted general purpose computing systems. Coulson et al. [3] described partitioning of magnetic disk storage cache for efficient cache memory utilization under varying system demands. In [4], memory is partitioned into sections such that some sections that are not likely to be used in the near future can be turned off, in order to avoid unnecessary refreshing. Wisler et al. [5] presents partitioning of memory shared by multiple processors, so that each processor has exclusive access to an associated memory partition. A cache manager that dynamically partitions cache storage across processes using a modified steepest descent method according to the cache performance is given in [6]. Olarig et al. [7] presents dynamic adjustment of private cache sizes in a cache system by moving cache segments from one private cache to another. All of these works describe methods to either modify cache sizes or reserve memory sections for specific processes rather than modifying the actual underlying cache hierarchy. Blumrich [8] invented a method to dynamically partition a shared cache to obtain private caches when sharing is no longer required. Although this provides isolation between applications, these partitions are still a part of the same physical module and suffering the similar access latencies as the baseline shared cache. Moreover, these isolated caches are accessed with separate address spaces. Also, none of the above-mentioned techniques are based on compiler analysis.
Multiple memories, which are accessed with the original addresses of a sequential program as in our invention, are also related to the concept of multiple address spaces. There has been historical work on computer architectures with multiple address spaces, for overcoming address space size limitations, and for achieving enhanced security. The IBM Enterprise System Architecture/370 [14] is one such example. However, unlike these prior works, the method of the present invention automatically creates a new program using multiple address spaces starting from a sequential program running on a single address space, through compiler analysis. This second program using multiple address spaces can in turn be converted into a custom hardware accelerator functionally identical to the original sequential program. The multiple address spaces in the present invention help achieve enhanced parallelism, and improved memory coherence hardware, and have a hierarchical organization. These features were not present in earlier works on multiple address spaces.
The sub-block placement technique for reducing cache traffic [16] has introduced multiple valid bits in a cache block, similar to our third optimization (a technique which also adds multiple valid bits to a cache block, to be described in the preferred embodiment section). However, unlike prior work, the combination of dirty and valid bits in the caches described in the present invention 1-ensures that write misses never cause a block to be read from the next level cache and 2-simultaneously avoids the false sharing error.
The Address Resolution Buffer of the multiscalar architecture[17] checks if a second speculatively executed thread has loaded a memory location before a logically preceding first thread has stored into the same location. Our eleventh optimization (speculative separation of memories) also monitors overlapping accesses to one or more memories. However, the problem solved by the present invention is different: there is no first and second thread. The present invention checks at runtime if the speculative assumption that the address spaces were disjoint was correct or not, and does so for more than two address spaces and also for hierarchically organized address spaces.
Prior works in the literature on compilers have created methods for dependence analysis [12], which is used to determine if two load/store instructions can refer to the same location. Dependence analysis is not our invention, although dependence analysis is used as a component of our invention.