The popularity of computing systems continues to grow and the demand for improved processing architectures thus likewise continues to grow. Ever-increasing desires for improved computing performance/efficiency has led to various improved processor architectures. For example, multi-core processors are becoming more prevalent in the computing industry and are being used in various computing devices, such as servers, personal computers (PCs), laptop computers, personal digital assistants (PDAs), wireless telephones, and so on.
In the past, processors such as CPUs (central processing units) featured a single execution unit to process instructions of a program. More recently, computer systems are being developed with multiple processors in an attempt to improve the computing performance of the system. In some instances, multiple independent processors may be implemented in a system. In other instances, a multi-core architecture may be employed, in which multiple processor cores are amassed on a single integrated silicon die. Each of the multiple processors (e.g., processor cores) can simultaneously execute program instructions. This parallel operation of the multiple processors can improve performance of a variety of applications.
A multi-core CPU combines two or more independent cores into a single package comprised of a single piece silicon integrated circuit (IC), called a die. In some instances, a multi-core CPU may comprise two or more dies packaged together. A dual-core device contains two independent microprocessors and a quad-core device contains four microprocessors. Cores in a multi-core device may share a single coherent cache at the highest on-device cache level (e.g., L2 for the Intel® Core 2) or may have separate caches (e.g. current AMD® dual-core processors). The processors also share the same interconnect to the rest of the system. Each “core” may independently implement optimizations such as superscalar execution, pipelining, and multithreading. A system with N cores is typically most effective when it is presented with N or more threads concurrently.
One processor architecture that has been developed utilizes multiple processors (e.g., multiple cores), which are homogeneous. As discussed hereafter, the processors are homogeneous in that they are all implemented with the same fixed instruction sets (e.g., Intel's x86 instruction set, AMD's Opteron instruction set, etc.). Further, the homogeneous processors access memory in a common way, such as all of the processors being cache-line oriented such that they access a cache block (or “cache line”) of memory at a time, as discussed further below.
In general, a processor's instruction set refers to a list of all instructions, and all their variations, that the processor can execute. Such instructions may include, as examples, arithmetic instructions, such as ADD and SUBTRACT; logic instructions, such as AND, OR, and NOT; data instructions, such as MOVE, INPUT, OUTPUT, LOAD, and STORE; and control flow instructions, such as GOTO, if X then GOTO, CALL, and RETURN. Examples of well-known instruction sets include x86 (also known as IA-32), x86-64 (also known as AMD64 and Intel® 64), AMD's Opteron, VAX (Digital Equipment Corporation), IA-64 (Itanium), and PA-RISC (HP Precision Architecture).
Generally, the instruction set architecture is distinguished from the microarchitecture, which is the set of processor design techniques used to implement the instruction set. Computers with different microarchitectures can share a common instruction set. For example, the Intel® Pentium and the AMD® Athlon implement nearly identical versions of the x86 instruction set, but have radically different internal microarchitecture designs. In all these cases the instruction set (e.g., x86) is fixed by the manufacturer and directly hardware implemented, in a semiconductor technology, by the microarchitecture. Consequently, the instruction set is traditionally fixed for the lifetime of this implementation.
FIG. 1 shows a block-diagram representation of an exemplary prior art system 100 in which multiple homogeneous processors (or cores) are implemented. System 100 comprises two subsystems: 1) a main memory (physical memory) subsystem 101 and 2) a processing subsystem 102 (e.g., a multi-core die). System 100 includes a first microprocessor core 104A and a second microprocessor core 104B. In this example, microprocessor cores 104A and 104B are homogeneous in that they are each implemented to have the same, fixed instruction set, such as x86. In addition, each of the homogeneous microprocessor cores 104A and 104B access main memory 101 in a common way, such as via cache block accesses, as discussed hereafter. Further, in this example, cores 104A and 104B are implemented on a common die 102. Main memory 101 is communicatively connected to processing subsystem 102. Main memory 101 comprises a common physical address space that microprocessor cores 104A and 104B can each reference.
As shown further in FIG. 1, a cache 103 is also implemented on die 102. Cores 104A and 104B are each communicatively coupled to cache 103. As is well known, a cache generally is memory for storing a collection of data duplicating original values stored elsewhere (e.g., to main memory 101) or computed earlier, where the original data is expensive to fetch (due to longer access time) or to compute, compared to the cost of reading the cache. In other words, a cache 103 generally provides a temporary storage area where frequently accessed data can be stored for rapid access. Once the data is stored in cache 103, future use can be made by accessing the cached copy rather than re-fetching the original data from main memory 101, so that the average access time is shorter. In many systems, cache access times are approximately 50 times faster than similar accesses to main memory 101. Cache 103, therefore, helps expedite data access that the micro-cores 104A and 104B would otherwise have to fetch from main memory 101.
In many system architectures, each core 104A and 104B will have its own cache also, commonly called the “L1” cache, and cache 103 is commonly referred to as the “L2” cache. Unless expressly stated herein, cache 103 generally refers to any level of cache that may be implemented, and thus may encompass L1, L2, etc. Accordingly, while shown for ease of illustration as a single block that is accessed by both of cores 104A and 104B, cache 103 may include L1 cache that is implemented for each core.
In many system architectures, virtual addresses are utilized. In general, a virtual address is an address identifying a virtual (non-physical) entity. As is well-known in the art, virtual addresses may be utilized for accessing memory. Virtual memory is a mechanism that permits data that is located on a persistent storage medium (e.g., disk) to be referenced as if the data was located in physical memory. Translation tables, maintained by the operating system, are used to determine the location of the reference data (e.g., disk or main memory). Program instructions being executed by a processor may refer to a virtual memory address, which is translated into a physical address. To minimize the performance penalty of address translation, most modern CPUs include an on-chip Memory Management Unit (MMU), and maintain a table of recently used virtual-to-physical translations, called a Translation Look-aside Buffer (TLB). Addresses with entries in the TLB require no additional memory references (and therefore time) to translate. However, the TLB can only maintain a fixed number of mappings between virtual and physical addresses; when the needed translation is not resident in the TLB, action will have to be taken to load it in.
As an example, suppose a program's instruction stream that is being executed by a processor, say processor core 104A of FIG. 1, desires to load data from an address “Foo” into a first general-purpose register, GPR1. Such instruction may appear similar to “LD <Foo>, GRP1”. Foo, in this example, is a virtual address that the processor translates to a physical address, such as address “123456”. Thus, the actual physical address, which may be formatted according to a global physical memory address format, is used to access cache 103 and/or memory 101.
In operation, each of cores 104A and 104B reference main memory 101 by providing a physical memory address. The physical memory address (of data or “an operand” that is desired to be retrieved) is first inputted to cache 103. If the addressed data is not encached (i.e., not present in cache 103), the same physical address is presented to main memory 101 to retrieve the desired data.
In contemporary architectures, the processor cores 104A and 104B are cache-line (or “cache-block”) oriented, wherein a “cache block” is fetched from main memory 101 and loaded into cache 103. The terms cache line and cache block are used interchangeably herein. Rather than retrieving only the addressed data from main memory 101 for storage to cache 103, such cache-block oriented processors may retrieve a larger block of data for storage to cache 103. A cache block typically comprises a fixed-size amount of data that is independent of the actual size of the requested data. For example, in most implementations a cache block comprises 64 bytes of data that is fetched from main memory 101 and loaded into cache 103 independent of the actual size of the operand referenced by the requesting micro-core 104A/104B. Furthermore, the physical address of the cache block referenced and loaded is a block address. This means that all the cache block data is in sequentially contiguous physical memory. Table 1 below shows an example of a cache block.
TABLE 1Physical AddressOperandXXX(7)Operand 7XXX(6)Operand 6. . .. . .XXX(1)Operand 1XXX(0)Operand 0
In the above example of table 1, the “XXX” portion of the physical address is intended to refer generically to the corresponding identifier (e.g., numbers and/or letters) for identifying a given physical address. For instance, XXX(0) corresponds to the physical address for an Operand 0, while XXX(1) corresponds to the physical address for an Operand 1, and so on. In the example of table 1, in response to a micro-core 104A/104B requesting Operand 0 via its corresponding physical address XXX(0), a 64-byte block of data may be fetched from main memory 101 and loaded into cache 103, wherein such cache block of data includes not only Operand 0 but also Operands 1-7. Thus, depending on the fixed size of the cache block employed on a given system, whenever a core 104A/104B references one operand (e.g., a simple load), the memory system will bring in 4 to 8 to 16 (or more) operands into cache 103.
There are both advantages and disadvantages of this traditional cache-block oriented approach to memory access. One advantage is that if there is temporal (over time) and spatial (data locality) references to operands (e.g., operands 0-7 in the example of Table 1), then cache 103 reduces the memory access time. Typically, cache access times (and data bandwidth) are 50 times faster than similar access to main memory 101. For many applications, this is the memory access pattern.
However, if the memory access pattern of an application is not sequential and/or does not re-use data, inefficiencies arise which result in decreased performance. Consider the following FORTRAN loop that may be executed for a given application:
DO I=1, N, 4  A(i) = B(i) + C(i)END DOIn this loop, every fourth element is used. If a cache block maintains 8 operands, then only 2 of the 8 operands are used. Thus, 6/8 of the data loaded into cache 103 and 6/8 of the memory bandwidth is “wasted” in this example.
In some architectures, special-purpose processors that are often referred to as “accelerators” are also implemented to perform certain types of operations. For example, a processor executing a program may offload certain types of operations to an accelerator that is configured to perform those types of operations efficiently. Such hardware acceleration employs hardware to perform some function faster than is possible in software running on the normal (general-purpose) CPU. Hardware accelerators are generally designed for computationally intensive software code. Depending upon granularity, hardware acceleration can vary from a small functional unit to a large functional block like motion estimation in MPEG2. Examples of such hardware acceleration include blitting acceleration functionality in graphics processing units (GPUs) and instructions for complex operations in CPUs. Such accelerator processors generally have a fixed instruction set that differs from the instruction set of the general-purpose processor, and the accelerator processor's local memory does not maintain cache coherency with the general-purpose processor.
A graphics processing unit (GPU) is a well-known example of an accelerator. A GPU is a dedicated graphics rendering device commonly implemented for a personal computer, workstation, or game console. Modern GPUs are very efficient at manipulating and displaying computer graphics, and their highly parallel structure makes them more effective than typical CPUs for a range of complex algorithms. A GPU implements a number of graphics primitive operations in a way that makes running them much faster than drawing directly to the screen with the host CPU. The most common operations for early two-dimensional (2D) computer graphics include the BitBLT operation (combines several bitmap patterns using a RasterOp), usually in special hardware called a “blitter”, and operations for drawing rectangles, triangles, circles, and arcs. Modern GPUs also have support for three-dimensional (3D) computer graphics, and typically include digital video-related functions.
Thus, for instance, graphics operations of a program being executed by host processors 104A and 104B may be passed to a GPU. While the homogeneous host processors 104A and 104B maintain cache coherency with each other, as discussed above with FIG. 1, they do not maintain cache coherency with accelerator hardware of the GPU. In addition, the GPU accelerator does not share the same physical or virtual address space of processors 104A and 104B.
In multi-processor systems, such as exemplary system 100 of FIG. 1, main memory 101 can be configured to improve performance. One approach for managing memory in a desirable way is known in the art as memory interleaving. In general, memory interleaving schemes attempt to distribute memory accesses evenly across the memory so as to mitigate or avoid hot spots within the memory.
As an example, one approach to memory interleaving is to spread the main memory 101 across different memory controllers. An interleave scheme may be employed to distribute the memory substantially evenly across the available memory controllers. As memory is more evenly distributed, large contiguous arrays of data touch each of the memory controllers substantially the same amount. Therefore, by interleaving memory, the memory is more evenly distributed so as to mitigate or avoid hot spots. Hot spots can occur, for example, if a given memory controller is overloaded due to large, unevenly distributed amounts of contiguous data being locally associated with the given memory controller. Various interleaving schemes for use with homogeneous processors, such as cores 104A and 104B of FIG. 1, are well known in the art.
In most systems, memory 101 may hold both programs and data. Each has unique characteristics pertinent to memory performance. For example, when a program is being executed, memory traffic is typically characterized as a series of sequential reads. On the other hand, when a data structure is being accessed, memory traffic is usually characterized by a stride, i.e., the difference in address from a previous access. A stride may be random or fixed. For example, repeatedly accessing a data element in an array may result in a fixed stride of two. As is well-known in the art, a lot of algorithms have a power of 2 stride. Accordingly, without some memory interleave management scheme being employed, hot spots may be encountered within the memory in which a common portion of memory (e.g., a given bank of memory) is accessed much more often than other portions of memory.
As is well-known in the art, memory is often arranged into independently controllable arrays, often referred to as “memory banks.” Under the control of a memory controller, a bank can generally operate on one transaction at a time. The memory may be implemented by dynamic storage technology (such as “DRAMS”), or of static RAM technology. In a typical DRAM chip, some number (e.g., 4, 8, and possibly 16) of banks of memory may be present. A memory interleaving scheme may be desired to minimize one of the banks of memory from being a “hot spot” of the memory.
As discussed above, many compute devices, such as the Intel x86 or AMD x86 microprocessors, are cache-block oriented. Today, a cache block of 64 bytes in size is typical, but compute devices may be implemented with other cache block sizes. A cache block is typically contained all on a single hardware memory storage element, such as a single dual in-line memory module (DIMM). As discussed above, when the cache-block oriented compute device accesses that DIMM, it presents one address and is returned the entire cache-block (e.g., 64 bytes).
Some compute devices, such as certain accelerator compute devices, may not be cache-block oriented. That is, those non-cache-block oriented compute devices may access portions of memory (e.g., words) on a much smaller, finer granularity than is accessed by the cache-block oriented compute devices. For instance, while a typical cache-block oriented compute device may access a cache block of 64 bytes for a single memory access request, a non-cache-block oriented compute device may access a Word that is 8 bytes in size in a single memory access request. That is, the non-cache-block oriented compute device in this example may access a particular memory DIMM and only obtain 8 bytes from a particular address present in that DIMM.
As discussed above, traditional multi-processor systems have employed homogeneous compute devices (e.g., processor cores 104A and 104B of FIG. 1) that each access memory 101 in a common manner, such as via cache-block oriented accesses. While some systems may further include certain heterogeneous compute elements, such as accelerators (e.g., a GPU), the heterogeneous compute element does not share the same physical or virtual address space of the homogeneous compute elements. Accordingly, traditional memory interleave schemes have not attempted to address an interleave of memory accesses across heterogeneous compute elements, which may access memory in different ways, such as via cache-block and non-cache-block accesses.