The popularity of computing systems continues to grow and the demand for improved processing architectures thus likewise continues to grow. Ever-increasing desires for improved computing performance/efficiency has led to various improved processor architectures. For example, multi-core processors are becoming more prevalent in the computing industry and are being used in various computing devices, such as servers, personal computers (PCs), laptop computers, personal digital assistants (PDAs), wireless telephones, and so on.
In the past, processors such as CPUs (central processing units) featured a single execution unit to process instructions of a program. More recently, computer systems are being developed with multiple processors in an attempt to improve the computing performance of the system. In some instances, multiple independent processors may be implemented in a system. In other instances, a multi-core architecture may be employed, in which multiple processor cores are amassed on a single integrated silicon die. Each of the multiple processors (e.g., processor cores) can simultaneously execute program instructions. This parallel operation of the multiple processors can improve performance of a variety of applications.
A multi-core CPU combines two or more independent cores into a single package comprised of a single piece silicon integrated circuit (IC), called a die. In some instances, a multi-core CPU may comprise two or more dies packaged together. A dual-core device contains two independent microprocessors and a quad-core device contains four microprocessors. Cores in a multi-core device may share a single coherent cache at the highest on-device cache level (e.g., L2 for the Intel® Core 2) or may have separate caches (e.g. current AMD® dual-core processors). The processors also share the same interconnect to the rest of the system. Each “core” may independently implement optimizations such as superscalar execution, pipelining, and multithreading. A system with N cores is typically most effective when it is presented with N or more threads concurrently.
One processor architecture that has been developed utilizes multiple processors (e.g., multiple cores), which are homogeneous in that they are all implemented with the same fixed instruction sets (e.g., Intel's x86 instruction set) AMD's Opteron instruction set, etc.). Further, the homogeneous processors may employ a cache memory coherency protocol, as discussed further below.
In general, an instruction set refers to a list of all instructions, and all their variations, that a processor can execute. Such instructions may include, as examples, arithmetic instructions, such as ADD and SUBTRACT; logic instructions, such as AND, OR, and NOT; data instructions, such as MOVE, INPUT, OUTPUT, LOAD, and STORE; and control flow instructions, such as GOTO, if X then GOTO, CALL, and RETURN. Examples of well-known instruction sets include x86 (also known as IA-32), x86-64 (also known as AMD64 and Intel® 64), AMD's Opteron, VAX (Digital Equipment Corporation), IA-64 (Itanium), and PA-RISC (HP Precision Architecture).
Generally, the instruction set architecture is distinguished from the microarchitecture, which is the set of processor design techniques used to implement the instruction set. Computers with different microarchitectures can share a common instruction set. For example, the Intel® Pentium and the AMD® Athlon implement nearly identical versions of the x86 instruction set, but have radically different internal microarchitecture designs. In all these cases the instruction set (e.g., x86) is fixed by the manufacturer and directly hardware implemented, in a semiconductor technology, by the microarchitecture. Consequently, the instruction set is fixed for the lifetime of this implementation.
Cache memory coherency is an issue that affects the design of computer systems in which two or more processors share a common area of memory. In general, processors often perform work by reading data from persistent storage (e.g., disk) into memory, performing some operation on that data, and then storing the result back to persistent storage. In a uniprocessor system, there is only one processor doing all the work, and therefore only one processor that can read or write the data values. Moreover a simple uniprocessor can only perform one operation at a time, and thus when a value in storage is changed, all subsequent read operations will see the updated value. However, in multiprocessor systems (e.g., multi-core architectures) there are two or more processors working at the same time, and so the possibility that the processors will all attempt to process the same value at the same time arises. Provided none of the processors updates the value, then they can share it indefinitely; but as soon as one updates the value, the others will be working on an out-of-date copy of the data. Accordingly, in such multiprocessor systems a scheme is generally required to notify all processors of changes to shared values, and such a scheme that is employed is commonly referred to as a “cache coherence protocol.” Various well-known protocols have been developed for maintaining cache coherency in multiprocessor systems, such as the MESI protocol, MSI protocol, MOSI protocol, and the MOESI protocol, are examples. Accordingly, such cache coherency generally refers to the integrity of data stored in local caches of the multiple processors.
FIG. 1 shows an exemplary prior art system 100 in which multiple homogeneous processors (or cores) are implemented. System 100 comprises two subsystems: 1) a main memory (physical memory) subsystem 101 and 2) a processing subsystem 102 (e.g., a multi-core die). System 100 includes a first microprocessor core 104A and a second microprocessor core 104B. In this example, microprocessor cores 104A and 104B are homogeneous in that they are each implemented to have the same, fixed instruction set, such as x86. Further, in this example, cores 104A and 104B are implemented on a common die 102. Main memory 101 is communicatively connected to processing subsystem 102. Main memory 101 comprises a common physical address space that microprocessor cores 104A and 104B can each reference.
As shown further shown, a cache 103 is also implemented on die 102. Cores 104A and 104B are each communicatively coupled to cache 103. As is well known, a cache generally is memory for storing a collection of data duplicating original values stored elsewhere (e.g., to main memory 101) or computed earlier, where the original data is expensive to fetch (due to longer access time) or to compute, compared to the cost of reading the cache. In other words, a cache 103 generally provides a temporary storage area where frequently accessed data can be stored for rapid access. Once the data is stored in cache 103, future use can be made by accessing the cached copy rather than re-fetching tie original data from main memory 101, so that the average access time is shorter. In many systems, cache access times are approximately 50 times faster than similar accesses to main memory 101. Cache 103, therefore, helps expedite data access that the micro-cores 104A and 104B would otherwise have to fetch from main memory 101.
In many system architectures, each core 104A and 104B will have its own cache also, commonly called the “L1” cache, and cache 103 is commonly referred to as the “L2” caches. Unless expressly stated herein, cache 103 generally refers to any level of cache that may be implemented, and thus may encompass L1, L2, etc. Accordingly, while shown for ease of illustration as a single block that is accessed by both of cores 104A and 104B, cache 103 may include L1 cache that is implemented for each core. Again, a cache coherency protocol may be employed to maintain the integrity of data stored in local caches of the multiple processor cores 104A/104B, as is well known.
In many architectures, virtual addresses are utilized. In general, a virtual address is an address identifying a virtual (non-physical) entity. As is well-known in the art, virtual addresses may be utilized for accessing memory. Virtual memory is a mechanism that permits data that is located on a persistent storage medium (e.g., disk) to be referenced as if the data was located in physical memory. Translation tables, maintained by the operating system, are used to determine the location of the reference data (e.g., disk or main memory). Program instructions being executed by a processor may refer to a virtual memory address, which is translated into a physical address. To minimize the performance penalty of address translation, most modern CPUs include an on-chip Memory Management Unit (MMU), and maintain a table of recently used virtual-to-physical translations, called a Translation Look-aside Buffer (TLB). Addresses with entries in the TLB require no additional memory references (and therefore time) to translate. However, the TLB can only maintain a fixed number of mappings between virtual and physical addresses; when the needed translation is not resident in the TLB, action will have to be taken to load it in.
As an example, suppose a program's instruction stream that is being executed by a processor, say processor core 104A of FIG. 1, desires to load data from an address “Foo” into a first general-purpose register, GPR1. Such instruction may appear similar to “LD <Foo>, GRP1”. Foo, in this example, is a virtual address that the processor translates to a physical address, such as address “123456”. Thus, the actual physical address, which may be formatted according to a global physical memory address format, is used to access cache 103 and/or memory 101.
Traditional implementations of cache 103 have proven to be extremely effective in many areas of computing because access patterns in many computer applications have locality of reference. There are several kinds of locality, including data that are accessed close together in time (temporal locality) and data that is located physically close to each other (spatial locality).
In operation, each of cores 104A and 104B reference main memory 101 by providing a physical memory address. The physical memory address (of data or “an operand” that is desired to be retrieved) is first inputted to cache 103. If the addressed data is not encached (i.e., not present in cache 103), the sane physical address is presented to main memory 101 to retrieve the desired data.
In contemporary architectures a cache block is fetched from main memory 101 and loaded into cache 103. That is, rather than retrieving only the addressed data from main memory 101 for storage to cache 103, a larger block of data may be retrieved for storage to cache 103. A cache block typically comprises a fixed-size amount of data that is independent of the actual size of the requested data. For example, in most implementations a cache block comprises 64 bytes of data that is fetched from main memory 101 and loaded into cache 103 independent of the actual size of the operand referenced by the requesting micro-core 104A/104B. Furthermore, the physical address of the cache block referenced and loaded is a block address. This means that all the cache block data is in sequentially contiguous physical memory. Table 1 below shows an example of a cache block.
TABLE 1Physical AddressOperandX, Y, Z (7)Operand 7X, Y, Z (6)Operand 6. . .. . .X, Y, Z (1)Operand 1X, Y, Z (0)Operand 0
In the example of table 1 in response to a micro-core 104/104B requesting Operand 0 via its corresponding physical address X,Y,Z (0), a 64-byte block of data may be fetched from main memory 101 and loaded into cache 103, wherein such block of data includes not only Operand 0 but also Operands 1-7. Thus, depending on the fixed size of the cache block employed on a given system, whenever a core 104A/104B references one operand (e.g. a simple load), the memory system will bring in 4 to 8 to 16 operands into cache 103.
There are both advantages and disadvantages of this traditional approach. One advantage is that if there is temporal (over time) and spatial (data locality) references to operands (e.g., operands 0-7 in the example of Table 1), then cache 103 reduces the memory access time. Typically, cache access times (and data bandwidth) are 50 times faster than similar access to main memory 101. For many applications, this is the memory access pattern.
However, if the memory access pattern of an application is not sequential and/or does not re-use data, inefficiencies arise which result in decreased performance. Consider the following FORTRAN loop that may be executed for a given application:
DO I=1, N, 4 A(i) = B(i) + C(i)END DOIn this loop, every fourth element is used. If a cache block maintains 8 operands, then only 2 of the 8 operands are used. Thus, 6/8 of the data loaded into cache 103 and 6/8 of the memory bandwidth is “wasted” in this example.
In some architectures, special-purpose processors that are often referred to as “accelerators” are also implemented to perform certain types of operations. For example, a processor executing a program may offload certain types of operations to an accelerator that is configured to perform those types of operations efficiently. Such hardware acceleration employs hardware to perform some function faster than is possible in software running on the normal (general-purpose) CPU. Hardware accelerators are generally designed for computationally intensive software code. Depending upon granularity, hardware acceleration can vary from a small functional unit to a large functional block like motion estimation in MPEG2. Examples of such hardware acceleration include blitting acceleration functionality in graphics processing units (GPUs) and instructions for complex operations in CPUs. Such accelerator processors generally have a fixed instruction set that differs from the instruction set of the general-purpose processor, and the accelerator processor's local memory does not maintain cache coherency with the general-purpose processor.
A graphics processing unit (GPU) is a well-known example of an accelerator. A GPU is a dedicated graphics rendering device commonly implemented for a personal computer, workstation, or game console. Modern GPUs are very efficient at manipulating and displaying computer graphics, and their highly parallel structure makes them more effective than typical CPUs for a range of complex algorithms. A GPU implements a number of graphics primitive operations in a way that makes running them much faster than drawing directly to the screen with the host CPU. The most common operations for early two-dimensional (2D) computer graphics include the BitBLT operation (combines several bitmap patterns using a RasterOp), usually in special hardware called a “blitter”, and operations for drawing rectangles triangles, circles, and arcs. Modern GPUs also have support for three-dimensional (3D) computer graphics, and typically include digital video-related functions.
Thus, for instance, graphics operations of a program being executed by host processors 104A and 104B may be passed to a GPU. While the homogeneous host processors 104A and 104B maintain cache coherency with each other, as discussed above with FIG. 1 they do not maintain cache coherency with accelerator hardware of the GPU. This means that the GPU reads and writes to its local memory are NOT part of the hardware-based cache coherency mechanism used by processors 104A and 104B. This also means that the CPU does not share the same physical or virtual address space of processors 104A and 104B.
Additionally, various devices are known that are reconfigurable. Examples of such reconfigurable devices include field-programmable gate arrays (FPGAs). A field-programmable gate array (FPGA) is a well-known type of semiconductor device containing programmable logic components called “logic blocks”, and programmable interconnects. Logic blocks can be programmed to perform the function of basic logic gates such as AND, and XOR, or more complex combinational functions such as decoders or simple mathematical functions. In most FPGAs, the logic blocks also include memory elements, which may be simple flip-flops or more complete blocks of memories. A hierarchy of programmable interconnects allows logic blocks to be interconnected as desired by a system designer. Logic blocks and interconnects can be programmed by the customer/designer, after the FPGA is manufactured, to implement any logical function, hence the name “field-programmable.”