The popularity of computing systems continues to grow and the demand for improved processing architectures thus likewise continues to grow. Ever-increasing desires for improved computing performance/efficiency has led to various improved processor architectures. For example, multi-core processors are becoming more prevalent in the computing industry and are being used in various computing devices, such as servers, personal computers (PCs), laptop computers, personal digital assistants (PDAs), wireless telephones, and so on.
In the past, processors such as CPUs (central processing units) featured a single execution unit to process instructions of a program. More recently, computer systems are being developed with multiple processors in an attempt to improve the computing performance of the system. In some instances, multiple independent processors may be implemented in a system. In other instances, a multi-core architecture may be employed, in which multiple processor cores are amassed on a single integrated silicon die. Each of the multiple processors (e.g., processor cores) can simultaneously execute program instructions. This parallel operation of the multiple processors can improve performance of a variety of applications.
A multi-core CPU combines two or more independent cores into a single package comprised of a single piece silicon integrated circuit (IC), called a die. In some instances, a multi-core CPU may comprise two or more dies packaged together. A dual-core device contains two independent microprocessors and a quad-core device contains four microprocessors. Cores in a multi-core device may share a single coherent cache at the highest on-device cache level (e.g., L2 for the Intel® Core 2) or may have separate caches (e.g. current AMD® dual-core processors). The processors also share the same interconnect to the rest of the system. Each “core” may independently implement optimizations such as superscalar execution, pipelining, and multithreading. A system with N cores is typically most effective when it is presented with N or more threads concurrently.
One processor architecture that has been developed utilizes multiple processors (e.g., multiple cores), which are homogeneous. As discussed hereafter, the processors are homogeneous in that they are all implemented with the same fixed instruction sets (e.g., Intel's x86 instruction set, AMD's Opteron instruction set, etc.). Further, the homogeneous processors access memory in a common way, such as all of the processors being cache-line oriented such that they access a cache block (or “cache line”) of memory at a time, as discussed further below.
In general, a processor's instruction set refers to a list of all instructions, and all their variations, that the processor can execute. Such instructions may include, as examples, arithmetic instructions, such as ADD and SUBTRACT; logic instructions, such as AND, OR, and NOT; data instructions, such as MOVE, INPUT, OUTPUT, LOAD, and STORE; and control flow instructions, such as GOTO, if X then GOTO, CALL, and RETURN. Examples of well-known instruction sets include x86 (also known as IA-32), x86-64 (also known as AMD64 and Intel® 64), AMD's Opteron, VAX (Digital Equipment Corporation), IA-64 (Itanium), and PA-RISC (HP Precision Architecture).
Generally, the instruction set architecture is distinguished from the microarchitecture, which is the set of processor design techniques used to implement the instruction set. Computers with different microarchitectures can share a common instruction set. For example, the Intel® Pentium and the AMD® Athlon implement nearly identical versions of the x86 instruction set, but have radically different internal microarchitecture designs. In all these cases the instruction set (e.g., x86) is fixed by the manufacturer and directly hardware implemented, in a semiconductor technology, by the microarchitecture. Consequently, the instruction set is traditionally fixed for the lifetime of this implementation.
FIG. 1 shows a block-diagram representation of an exemplary prior art system 100 in which multiple homogeneous processors (or cores) are implemented. System 100 comprises two subsystems: 1) a main memory (physical memory) subsystem 101 and 2) a processing subsystem 102 (e.g., a multi-core die). System 100 includes a first microprocessor core 104A and a second microprocessor core 104B. In this example, microprocessor cores 104A and 104B are homogeneous in that they are each implemented to have the same, fixed instruction set, such as x86. In addition, each of the homogeneous microprocessor cores 104A and 104B access main memory 101 in a common way, such as via cache block accesses, as discussed hereafter. Further, in this example, cores 104A and 104B are implemented on a common die 102. Main memory 101 is communicatively connected to processing subsystem 102. Main memory 101 comprises a common physical address space that microprocessor cores 104A and 104B can each reference.
As shown further in FIG. 1, a cache 103 is also implemented on die 102. Cores 104A and 104B are each communicatively coupled to cache 103. As is well known, a cache generally is memory for storing a collection of data duplicating original values stored elsewhere (e.g., to main memory 101) or computed earlier, where the original data is expensive to fetch (due to longer access time) or to compute, compared to the cost of reading the cache. In other words, a cache 103 generally provides a temporary storage area where frequently accessed data can be stored for rapid access. Once the data is stored in cache 103, future use can be made by accessing the cached copy rather than re-fetching the original data from main memory 101, so that the average access time is shorter. In many systems, cache access times are approximately 50 times faster than similar accesses to main memory 101. Cache 103, therefore, helps expedite data access that the micro-cores 104A and 104B would otherwise have to fetch from main memory 101.
In many system architectures, each core 104A and 104B will have its own cache also, commonly called the “L1” cache, and cache 103 is commonly referred to as the “L2” cache. Unless expressly stated herein, cache 103 generally refers to any level of cache that may be implemented, and thus may encompass L1, L2, etc. Accordingly, while shown for ease of illustration as a single block that is accessed by both of cores 104A and 104B, cache 103 may include L1 cache that is implemented for each core.
In many system architectures, virtual addresses are utilized. In general, a virtual address is an address identifying a virtual (non-physical) entity. As is well-known in the art, virtual addresses may be utilized for accessing memory. Virtual memory is a mechanism that permits data that is located on a persistent storage medium (e.g., disk) to be referenced as if the data was located in physical memory. Translation tables, maintained by the operating system, are used to determine the location of the reference data disk or main memory). Program instructions being executed by a processor may refer to a virtual memory address, which is translated into a physical address. To minimize the performance penalty of address translation, most modern CPUs include an on-chip Memory Management Unit (MMU), and maintain a table of recently used virtual-to-physical translations, called a Translation Look-aside Buffer (TLB). Addresses with entries in the TLB require no additional memory references and therefore time) to translate. However, the TLB can only maintain a fixed number of mappings between virtual and physical addresses; when the needed translation is not resident in the TLB, action will have to be taken to load it in.
As an example, suppose a program's instruction stream that is being executed by a processor, say processor core 104A of FIG. 1, desires to load data from an address “Foo” into a first general-purpose register, GPR1. Such instruction may appear similar to “LD <Foo>, GPR1”. Foo, in this example, is a virtual address that the processor translates to a physical address, such as address “123456”. Thus, the actual physical address, which may be formatted according to a global physical memory address format, is used to access cache 103 and/or memory 101.
In operation, each of cores 104A and 104B reference main memory 101 by providing a physical memory address. The physical memory address (of data or “an operand” that is desired to be retrieved) is first presented to cache 103. If the addressed data is not encached (i.e., not present in cache 103), the same physical address is presented to main memory 101 to retrieve the desired data. Main memory 101 may be implemented in whole or in part via memory module(s), such as dual in-line memory modules (DIMMs), which may employ dynamic random access memory (DRAM) or other memory storage.
In contemporary architectures, the processor cores 104A and 104B are cache-line (or “cache-block”) oriented, wherein a “cache block” is fetched from main memory 101 and loaded into cache 103. The terms cache line and cache block are used interchangeably herein. Rather than retrieving only the addressed data from main memory 101 for storage to cache 103, such cache-block oriented processors may retrieve a larger block of data for storage to cache 103. A cache block typically comprises a fixed-size amount of data that is independent of the actual size of the requested data. For example, in most implementations a cache block comprises 64 bytes of data that is fetched from main memory 101 and loaded into cache 103 independent of the actual size of the operand referenced by the requesting micro-core 104A/104B. Furthermore, the physical address of the cache block referenced and loaded is a block address. This means that all the cache block data is in sequentially contiguous physical memory. Table 1 below shows an example of a cache block.
TABLE 1Physical AddressOperandXXX(7)Operand 7XXX(6)Operand 6. . .. . .XXX(1)Operand 1XXX(0)Operand 0
In the above example of table 1, the “XXX” portion of the physical address is intended to refer generically to the corresponding identifier (e.g., numbers and/or letters) for identifying a cache line address. For instance, XXX(0) corresponds to the physical address for an Operand 0, while XXX(1) corresponds to the physical address for an Operand 1, and so on. In the example of table 1, in response to a micro-core 104A/104B requesting Operand 0 via its corresponding physical address XXX(0), a 64-byte block of data may be fetched from main memory 101 and loaded into cache 103, wherein such cache block of data includes not only Operand 0 but also Operands 1-7. Thus, depending on the fixed size of the cache block employed on a given system, whenever a core 104A/104B references one operand (e.g., a simple load), the memory system will bring in 4 to 8 to 16 (or more) operands into cache 103.
There are both advantages and disadvantages of this traditional cache-block oriented approach to memory access. One advantage is that if there is temporal (over time) and spatial (data locality) references to operands (e.g., operands 0-7 in the example of Table 1), then cache 103 reduces the memory access time. Typically, cache access times (and data bandwidth) are 50 times faster than similar access to main memory 101. For many applications, this is the memory access pattern.
However, if the memory access pattern of an application is not sequential and/or does not re-use data, inefficiencies arise which result in decreased performance. Consider the following FORTRAN loop that may be executed for a given application:                DO I=1, N, 4        A(i)=B(i)+C(i)        END DO        
In this loop, every fourth element is used. If a cache block maintains 8 operands, then only 2 of the 8 operands are used. Thus, 6/8 of the data loaded into cache 103 and 6/8 of the memory bandwidth is “wasted” in this example.
In multi-processor systems, such as exemplary system 100 of FIG. 1, main memory 101 can be configured to improve performance. FIG. 2 shows a block diagram illustrating a traditional implementation of main memory 101. As shown, memory module 202, which comprises memory (e.g., DRAMs) 203, is accessible via memory controller 201. That is, memory controller 201 controls access to memory module 202. Memory module 202 is commonly implemented as a DIMM (dual in-line memory module) that includes one or more DRAMs (dynamic random access memory) as memory 203. In general, a DIMM is a double SIMM (single in-line memory module). Like a SIMM, a DIMM contains one or several random access memory (RAM) chips on a small circuit board with pins that connect it to the computer motherboard.
Traditional DIMMs provide one data channel 205 and one address/control channel 204 per DIMM. In general, the address/control channel 204 specifies an address and a desired type of access (e.g., read or write), and the data channel 205 carries the corresponding data to/from the specified address for performing the desired type of access. Typically, a memory access operation requires several clock cycles to perform. For instance, address and control information may be provided on the address/control channel 204 over one or more clock cycles, and then the data is provided on the data channel 205 over later clock cycles. In a typical DIMM access scenario, a row select command is sent from memory controller 201 on the address/control channel 204 to the memory module 202, which indicates that an associated address is a row address in the memory cell matrix of the DRAM memory 203. In general, a data bit in DRAM is stored in a memory cell located by the intersection of a column address and a row address. A column access command (e.g., a column read or column write command) is sent from the memory controller 201 over the address/control channel 204 to validate the column address and indicate a type of access desired (e.g., either a read or write operation).
The row select command may be sent in a first clock cycle, then the column access command may be sent in a second clock cycle, and then some clock cycles later a burst of data may be supplied via the data channel 204. The burst of data may be supplied over several clock cycles. Typically, single DIMM data channel 205 is typically a 64-bit (8-byte) wide channel, wherein each access comprises a “burst” length of 8, thus resulting in the data channel carrying 64 bytes for each access. The length of the “burst” may refer to a number of clock cycles or phases of a clock cycle when dual-data rate (DDR) is employed. For instance, a burst length of 8 may refer to 8 clock cycles, wherein 8 bytes of data is communicated on the data channel for a given access in each of the 8 clock cycles (resulting in the data channel carrying 64 bytes of data for the access). As another example, a burst length of 8 may refer to 8 phases of a clock (e.g., when DDR is employed), wherein 8 bytes of data is communicated on the data channel for a given access in each of the 8 phases (over 4 clock cycles), thus resulting in the data channel carrying 64 bytes of data for the access.
To improve data channel bandwidth, tiling is commonly employed in memory architectures. For instance, rather than waiting for completion of a burst of data for one access operation before supplying address/control signals for a next access operation, the instructions supplied via the address/control channel 204 may be used to attempt to maintain full bandwidth utilization of the data channel 205. FIG. 3 shows an example of one traditional tiling technique. FIG. 3 shows a clock cycle 301 of a reference clock signal, wherein the illustrated example shows 20 clock cycles numbered 1-20. A clock phase 302 is also shown, wherein for each clock cycle the clock has a low phase (“L”) and a high phase (“H”), as is well known. An address/control channel 303 is also shown, which corresponds to address/control channel 204 of FIG. 2. Also, in this example, a data channel 304 is shown, which corresponds to data channel 205 of FIG. 2.
The exemplary tiling technique of FIG. 3 allows for the address/control channel 303 to be used to maintain high bandwidth utilization of the data channel 304. In the illustrated example, a first memory access operation is requested, whereupon a row select command 306 is communicated from memory controller 201 to memory module 202 over address/control channel 303 during clock cycle 1. Then, during clock cycle 2, a column access command (e.g., column read or column write command) 307 for the first memory access operation is communicated from memory controller 201 to memory module 202 over address/control channel 303. After some delay, data channel 304 carries the data “burst” for the first memory access operation. For instance, beginning in the high phase of clock cycle 9 and ending in the low phase of clock cycle 13, data burst 308 carries the data for the first memory access operation. Traditionally, a single DIMM data channel, such as data channel 304, is typically a 64-bit (8-byte) wide channel where each memory access comprises a “burst” length of 8, thus resulting in the data channel carrying 64 bytes for each access. For instance, each of the 8 blocks of burst 308 (labeled 0/0/0-0/0/7) is typically an 8-byte block of data, thus resulting in burst 308 containing 64 bytes of data for the first memory access operation (read or write to/from the specified address).
A second memory access operation is requested in this example, whereupon a row select command 309 is communicated from memory controller 201 to memory module 202 over address/control channel 303 during clock cycle 5. Then, during clock cycle 6, a column access command 310 for the second memory access operation is communicated from memory controller 201 to memory module 202 over address/control channel 303. After some delay, data channel 304 carries the data “burst” for the second memory access operation. For instance, beginning in the high phase of clock cycle 13 and ending in the low phase of clock cycle 17, data burst 311 carries the data for the second memory access operation. As with the data burst 308 discussed above for the first memory access operation, data burst 311 typically has a length of 8 blocks (labeled 0/1/0-0/1/7) that are each an 8-byte block of data, thus resulting in burst 311 containing 64 bytes of data for the third memory access operation (read or write to/from the specified address).
As the example of FIG. 3 illustrates, rather than waiting for the data burst 308 for a first memory access operation to complete before providing the address/control information for the next memory access operation to be performed, the tiling technique uses the address/control channel 303 to effectively schedule the data bursts for different memory access operations back-to-back, thereby maintaining high bandwidth utilization on the data channel 304.
As also illustrated in FIG. 3, traditionally the data channel 205 of a DIMM carries a 64-byte burst of data for each memory access operation requested. Some DIMMs can support 64-byte or 32-byte accesses. That is, some DIMMs may be configured into either a 64-byte access or a 32-byte access mode. Thus, memory bandwidth may be conserved to some extent for certain memory access operations by performing a 32-byte access of the DIMM, rather than a 64-byte access (if the operation only requires access of 32 or fewer bytes). However, the full burst of either 32-bytes or 64-bytes is utilized for a single memory access operation.
In certain implementations, a plurality of DIMMs may share an address/control channel, and each DIMM may provide a separate data channel, wherein tiling may be employed on the address/control channel to maintain high bandwidth utilization on both data channels of the DIMMs. However, in these implementations, each DIMM provides only a single data channel.
As is well-known in the art, memory is often arranged into independently controllable arrays, often referred to as “memory banks.” Under the control of a memory controller, a bank can generally operate on one transaction at a time. As mentioned above, the memory may be implemented by dynamic storage technology (such as “DRAMS”), or of static RAM technology. In a typical DRAM chip, some number (e.g., 4, 8, and possibly 16) of banks of memory may be present. A memory interleaving scheme may be desired to minimize one of the banks of memory from being a “hot spot” of the memory.
In most systems, memory 101 may hold both programs and data. Each has unique characteristics pertinent to memory performance. For example, when a program is being executed, memory traffic is typically characterized as a series of sequential reads. On the other hand, when a data structure is being accessed, memory traffic is usually characterized by a stride, i.e., the difference in address from a previous access. A stride may be random or fixed. For example, repeatedly accessing a data element in an array may result in a fixed stride of two. As is well-known in the art, a lot of algorithms have a power of 2 stride. This power of 2 stride gives rise to an increase in occurrences of bank conflicts because the power of 2 stride ends up accessing the same bank repeatedly. Accordingly, without some memory interleave management scheme being employed, hot spots may be encountered within the memory in which a common portion of memory (e.g., a given bank of memory) is accessed much more often than other portions of memory.
As discussed above, many compute devices, such as the Intel x86 or AMD x86 microprocessors, are cache-block oriented. Today, a cache block of 64 bytes in size is typical, but compute devices may be implemented with other cache block sizes. A cache block is typically contained all on a single hardware memory storage element, such as a single dual in-line memory module (DIMM). As discussed above, when the cache-block oriented compute device accesses that DIMM, it presents one address and is returned the entire cache-block (e.g., 64 bytes), as in the exemplary data bursts 308 and 311 discussed above with FIG. 3.
Some compute devices, such as certain accelerator compute devices, may not be cache-block oriented. That is, those non-cache-block oriented compute devices may access portions of memory (e.g., words) on a much smaller, finer granularity than is accessed by the cache-block oriented compute devices. For instance, while a typical cache-block oriented compute device may access a cache block of 64 bytes for a single memory access request, a non-cache-block oriented compute device may desire to access a Word that is 8 bytes in size in a single memory access request. That is, the non-cache-block oriented compute device in this example may desire to access a particular memory DIMM and only obtain 8 bytes from a particular address present in the DIMM.
As discussed above, traditional multi-processor systems have employed homogeneous compute devices (e.g., processor cores 104A and 104B of FIG. 1) that each access memory 101 in a common manner, such as via cache-block oriented accesses. While some systems may further include certain heterogeneous compute elements, such as accelerators (e.g., a GPU), the heterogeneous compute element does not share the same physical or virtual address space of the homogeneous compute elements. Accordingly, traditional memory interleave schemes have not attempted to address an interleave of memory accesses across heterogeneous compute elements, which may access memory in different ways, such as via cache-block and non-cache-block accesses.
U.S. Patent Application Publication No. 2007/0266206 to Kim et al. (hereinafter “Kim”) proposes a scatter-gather intelligent memory architecture. Kim mentions that to avoid wasting memory bandwidth, the scatter/gather engine supports both cache line size data accesses and smaller, sub-cache line accesses. However, Kim does not appear to describe its memory architecture in detail. One of ordinary skill in the art would thus suppose that Kim may be employing the above-mentioned traditional DIMMs, which enable either a full cache line (e.g., 64 bytes) or a sub-cache line (e.g., 32 bytes) access. However, as with the traditional DIMMs, only a single data channel per DIMM appears to be supported. Kim does not appear to provide any disclosure of a DIMM architecture that provides more than a single data channel per DIMM.