1. Background
The popularity of computing systems continues to grow and the demand for improved processing architectures thus likewise continues to grow. Ever-increasing desires for improved computing performance and efficiency has led to various improved processor architectures. For example, multi-core processors are becoming more prevalent in the computing industry and are being used in various computing devices, such as servers, personal computers (PCs), laptop computers, personal digital assistants (PDAs), wireless telephones, and so on.
In the past, processors such as CPUs (central processing units) featured a single execution unit to process instructions of a program. More recently, computer systems are being developed with multiple processors in an attempt to improve the computing performance of the system. In some instances, multiple independent processors may be implemented in a system. In other instances, a multi-core architecture may be employed, in which multiple processor cores are amassed on a single integrated silicon die. Each of the multiple processors (e.g., processor cores) can simultaneously execute program instructions. This parallel operation of the multiple processors can improve performance of a variety of applications.
A multi-core CPU combines two or more independent cores into a single package comprised of a single piece silicon integrated circuit (IC), called a die. In some instances, a multi-core CPU may comprise two or more dies packaged together. A dual-core device contains two independent microprocessors and a quad-core device contains four microprocessors. Cores in a multi-core device may share a single coherent cache at the highest on-device cache level (e.g., L2 for the Intel® Core 2) or may have separate caches (e.g. current AMD® dual-core processors). The processors also share the same interconnect to the rest of the system. Each “core” may independently implement optimizations such as superscalar execution, pipelining, and multithreading. A system with A cores is typically most effective when it is presented with N or more threads concurrently.
One processor architecture that has been developed utilizes multiple processors (e.g., multiple cores), which are homogeneous. The processors are homogeneous in that they are all implemented with the same fixed instruction sets (e.g., Intel's x86 instruction set, AMD's Opteron instruction set, etc.). Further, the homogeneous processors access memory in a common way, such as all of the processors being cache-line oriented such that they access a cache block (or “cache line”) of memory at a time.
In general, a processor's instruction set refers to a list of all instructions, and all their variations, that the processor can execute. Such instructions may include, as examples, arithmetic instructions, such as ADD and SUBTRACT; logic instructions, such as AND, OR, and NOT; data instructions, such as MOVE, INPUT, OUTPUT, LOAD, and STORE; and control flow instructions, such as GOTO, if X then GOTO, CALL, and RETURN. Examples of well-known instruction sets include x86 (also known as IA-32), x86-64 (also known as AMD64 and Intel® 64), AMD's Opteron, VAX (Digital Equipment Corporation), IA-64 (Itanium), and PA-RISC (HP Precision Architecture).
Generally, the instruction set architecture is distinguished from the microarchitecture, which is the set of processor design techniques used to implement the instruction set. Computers with different microarchitectures can share a common instruction set. For example, the Intel® Pentium and the AMD® Athlon implement nearly identical versions of the x86 instruction set, but have radically different internal microarchitecture designs. In all these cases the instruction set (e.g., x86) is fixed by the manufacturer and directly hardware implemented, in a semiconductor technology, by the microarchitecture. Consequently, the instruction set is traditionally fixed for the lifetime of this implementation.
FIG. 1 shows a block-diagram representation of an exemplary prior art system 100 in which multiple homogeneous processors (or cores) are implemented. System 100 comprises two subsystems: 1) a main memory (physical memory) subsystem 101 and 2) a processing subsystem 102 (e.g., a multi-core die). System 100 includes a first microprocessor core 104A and a second microprocessor core 104B. In this example, microprocessor cores 104A and 104B are homogeneous in that they are each implemented to have the same, fixed instruction set, such as x86. In addition, each of the homogeneous microprocessor cores 104A and 104B access main memory 101 in a common way, such as via cache block accesses, as discussed hereafter. Further, in this example, cores 104A and 104B are implemented on a common die 102. Main memory 101 is communicatively connected to processing subsystem 102. Main memory 101 comprises a common physical address space that microprocessor cores 104A and 104B can each reference.
As shown further in FIG. 1, a cache 103 is also implemented on die 102. Cores 104A and 104B are each communicatively coupled to cache 103. As is well known, a cache generally is memory for storing a collection of data duplicating original values stored elsewhere (e.g., to main memory 101) or computed earlier, where the original data is expensive to fetch (due to longer access time) or to compute, compared to the cost of reading the cache. In other words, a cache 103 generally provides a temporary storage area where frequently accessed data can be stored for rapid access. Once the data is stored in cache 103, future use can be made by accessing the cached copy rather than re-fetching the original data from main memory 101, so that the average access time is shorter. In many systems, cache access times are approximately 50 times faster than similar accesses to main memory 101. Cache 103, therefore, helps expedite data access that the micro-cores 104A and 104B would otherwise have to fetch from main memory 101.
In many system architectures, each core 104A and 104B will have its own cache also, commonly called the “L1” cache, and cache 103 is commonly referred to as the “L2” cache. Unless expressly stated herein, cache 103 generally refers to any level of cache that may be implemented, and thus may encompass L1, L2, etc. Accordingly, while shown for ease of illustration as a single block that is accessed by both of cores 104A and 104B, cache 103 may include L1 cache that is implemented for each core.
In many system architectures, virtual addresses are utilized. In general, a virtual address is an address identifying a virtual (non-physical) entity. As is well-known in the art, virtual addresses may be utilized for accessing memory. Virtual memory is a mechanism that permits data that is located on a persistent storage medium (e.g., disk) to be referenced as if the data was located in physical memory. Translation tables, maintained by the operating system, are used to determine the location of the reference data (e.g., disk or main memory). Program instructions being executed by a processor may refer to a virtual memory address, which is translated into a physical address. To minimize the performance penalty of address translation, most modern CPUs include an on-chip Memory Management Unit (MMU), and maintain a table of recently used virtual-to-physical translations, called a Translation Look-aside Buffer (TLB). Addresses with entries in the TLB require no additional memory references (and therefore time) to translate. However, the TLB can only maintain a fixed number of mappings between virtual and physical addresses; when the needed translation is not resident in the TLB, action will have to be taken to load it in.
In some architectures, special-purpose processors that are often referred to as “accelerators” are also implemented to perform certain types of operations. For example, a processor executing a program may offload certain types of operations to an accelerator that is configured to perform those types of operations efficiently. Such hardware acceleration employs hardware to perform some function faster than is possible in software running on the normal (general-purpose) CPU. Hardware accelerators are generally designed for computationally intensive software code. Depending upon granularity, hardware acceleration can vary from a small function unit to a large functional block like motion estimation in MPEG2. Examples of such hardware acceleration include blitting acceleration functionality in graphics processing units (GPUs) and instructions for complex operations in CPUs. Such accelerator processors generally have a fixed instruction set that differs from the instruction set of the general-purpose processor, and the accelerator processor's local memory does not maintain cache coherency with the general-purpose processor.
A graphics processing unit (GPU) is a well-known example of an accelerator. A GPU is a dedicated graphics rendering device commonly implemented for a personal computer, workstation, or game console. Modern GPUs are very efficient at manipulating and displaying computer graphics, and their highly parallel structure makes them more effective than typical CPUs for a range of complex algorithms. A GPU implements a number of graphics primitive operations in a way that makes running them much faster than drawing directly to the screen with the host CPU. The most common operations for early two-dimensional (2D) computer graphics include the BitBLT operation (combines several bitmap patterns using a RasterOp), usually in special hardware called a “blitter”, and operations for drawing rectangles, triangles, circles, and arcs. Modern GPUs also have support for three-dimensional (3D) computer graphics, and typically include digital video-related functions.
Thus, for instance, graphics operations of a program being executed by host processors 104A and 104B may be passed to a GPU. While the homogeneous host processors 104A and 104B maintain cache coherency with each other, as discussed above with FIG. 1, they do not maintain cache coherency with accelerator hardware of the GPU. In addition, the GPU accelerator does not share the same physical or virtual address space of processors 104A and 104B.
In multi-processor systems, such as exemplary system 100 of FIG. 1, one or more of the processors may be implemented as a vector processor. In general, vector processors are processors which provide high level operations on vectors—that is, linear arrays of data. As one example, a typical vector operation might add two 64-entry, floating point vectors to obtain a single 64-entry vector. In effect, one vector instruction is generally equivalent to a loop with each iteration computing one of the 64 elements of the result, updating all the indices and branching back to the beginning. Vector operations are particularly useful for certain types of processing, such as image processing or processing of certain scientific or engineering applications where large amounts of data is desired to be processed in generally a repetitive manner. In a vector processor, the computation of each result is generally independent of the computation of previous results, thereby allowing a deep pipeline without generating data dependencies or conflicts. In essence, the absence of data dependencies is determined by the particular application to which the vector processor is applied, or by the compiler when a particular vector operation is specified. Traditional vector processors typically include a pipeline scalar unit together with a vector unit. In vector-register processors, the vector operations, except loads and stores, use the vector registers.
In most systems, memory 101 may hold both programs and data. Each has unique characteristics pertinent to memory performance. For example, when a program is being executed, memory traffic is typically characterized as a series of sequential reads. On the other hand, when a data structure is being accessed, memory traffic is usually characterized by a stride, i.e., the difference in address from a previous access. A stride may be random or fixed. For example, repeatedly accessing a data element in an array may result in a fixed stride of two. As is well-known in the art, a lot of algorithms have a power of 2 stride. Accordingly, without some memory interleave management scheme being employed, hot spots may be encountered within the memory in which a common portion of memory (e.g., a given bank of memory) is accessed much more often than other portions of memory.
As is well-known in the art, memory is often arranged into independently controllable arrays, often referred to as “memory banks.” Under the control of a memory controller, a bank can generally operate on one transaction at a time. The memory may be implemented by dynamic storage technology (such as “DRAMS”), or of static RAM technology. In a typical DRAM chip, some number (e.g., 4, 8, and possibly 16) of banks of memory may be present. A memory interleaving scheme may be desired to minimize one of the banks of memory from being a “hot spot” of the memory.
As discussed above, many compute devices, such as the Intel x86 or AMD x86 microprocessors, are cache-block oriented. Today, a cache block of 64 bytes in size is typical, but compute devices may be implemented with other cache block sizes. A cache block is typically contained all on a single hardware memory storage element, such as a single dual in-line memory module (DIMM). As discussed above, when the cache-block oriented compute device accesses that DIMM, it presents one address and is returned the entire cache-block (e.g., 64 bytes).
Some compute devices, such as certain accelerator compute devices, may not be cache-block oriented. That is, those non-cache-block oriented compute devices may access portions of memory (e.g., words) on a much smaller, finer granularity than is accessed by the cache-block oriented compute devices. For instance, while a typical cache-block oriented compute device may access a cache block of 64 bytes for a single memory access request, a non-cache-block oriented compute device may access a Word that is 8 bytes in size in a single memory access request. That is, the non-cache-block oriented compute device in this example may access a particular memory DIMM and only obtain 8 bytes from a particular address present in that DIMM.
As discussed above, traditional multi-processor systems have employed homogeneous compute devices (e.g., processor cores 104A and 104B of FIG. 1) that each access memory 101 in a common manner, such as via cache-block oriented accesses. While some systems may further include certain heterogeneous compute elements, such as accelerators (e.g., a GPU), the heterogeneous compute element does not share the same physical or virtual address space of the homogeneous compute elements.
2. Related Art
More recently, some systems have been developed that include heterogeneous compute elements. For instance, the above-identified related U.S. patent applications (the disclosures of which are incorporated herein by reference) disclose various implementations of exemplary heterogeneous computing architectures. In certain implementations, the architecture comprises a multi-processor system having at least one host processor and one or more heterogeneous co-processors. Further, in certain implementations, at least one of the heterogeneous co-processors may be dynamically reconfigurable to possess any of various different instruction sets. The host processor(s) may comprise a fixed instruction set, such as the well-known x86 instruction set, while the co-processor(s) may comprise dynamically reconfigurable logic that enables the co-processor's instruction set to be dynamically reconfigured. In this maimer, the host processor(s) and the dynamically reconfigurable co-processor(s) are heterogeneous processors because the dynamically reconfigurable co-processor(s) may be configured to have a different instruction set than that of the host processor(s).
According to certain embodiments, the co-processor(s) may be dynamically reconfigured with an instruction set for use in optimizing performance of a given executable. For instance, in certain embodiments, one of a plurality of predefined instruction set images may be loaded onto the co-processor(s) for use by the co-processor(s) in processing a portion of a given executable's instruction stream. Thus, certain instructions being processed for a given application may be off-loaded (or “dispatched”) from the host processor(s) to the heterogeneous co-processor(s) which may be configured to process the off-loaded instructions in a more efficient manner.
Thus, in certain implementations, the heterogeneous co-processor(s) comprise a different instruction set than the native instruction set of the host processor(s). Further, in certain embodiments, the instruction set of the heterogeneous co-processor(s) may be dynamically reconfigurable. As an example, in one implementation at least three (3) mutually-exclusive instruction sets may be pre-defined, any of which may be dynamically loaded to a dynamically-reconfigurable heterogeneous co-processor. As an illustrative example, a first pre-defined instruction set might be a vector instruction set designed particularly for processing 64-bit floating point operations as are commonly encountered in computer-aided simulations; a second pre-defined instruction set might be designed particularly for processing 32-bit floating point operations as are commonly encountered in signal and image processing applications; and a third pre-defined instruction set might be designed particularly for processing cryptography-related operations. While three illustrative pre-defined instruction sets are mention above, it should be recognized that embodiments of the present invention are not limited to the exemplary instruction sets mentioned above. Rather, any number of instruction sets of any type may be pre-defined in a similar manner and may be employed on a given system in addition to or instead of one or more of the above-mentioned pre-defined instruction sets.
In certain implementations, the heterogeneous compute elements (e.g., host processor(s) and co-processor(s)) share a common physical and/or virtual address space of memory. As an example, a system may comprise one or more host processor(s) that are cache-block oriented, and the system may further comprise one or more compute elements co-processor(s) that are non-cache-block oriented. For instance, the cache-block oriented compute element(s) may access main memory in cache blocks of, say, 64 bytes per request, whereas the non-cache-block oriented compute element(s) may access main memory via smaller-sized requests (which may be referred to as “sub-cache-block” requests), such as 8 bytes per request.
One exemplary heterogeneous computing system that may include one or more cache-block oriented compute elements and one or more non-cache-block oriented compute elements is that disclosed in co-pending U.S. patent application Ser. No. 11/841,406 filed Aug. 20, 2007 titled “MULTI-PROCESSOR SYSTEM HAVING AT LEAST ONE PROCESSOR THAT COMPRISES A DYNAMICALLY RECONFIGURABLE INSTRUCTION SET”, the disclosure of which is incorporated herein by reference. For instance, in such a heterogeneous computing system, one or more host processors may be cache-block oriented, while one or more of the dynamically-reconfigurable co-processor(s) may be non-cache-block oriented, and the heterogeneous host processor(s) and co-processor(s) share access to the common main memory (and share a common physical and virtual address space of the memory).
Another exemplary heterogeneous computing system is that disclosed in co-pending U.S. patent application Ser. No. 11/969,792 filed Jan. 4, 2008 titled “MICROPROCESSOR ARCHITECTURE HAVING ALTERNATIVE MEMORY ACCESS PATHS” (hereinafter “the '792 application”), the disclosure of which is incorporated herein by reference. In particular, the '792 application discloses an exemplary heterogeneous compute system in which one or more compute elements (e.g., host processors) are cache-block oriented and one or more heterogeneous compute elements (e.g., co-processors) are sub-cache-block oriented to access data at a finer granularity than the cache block.
While the above-referenced related applications describe exemplary heterogeneous computing systems in which embodiments of the present invention may be implemented, the concepts presented herein are not limited in application to those exemplary heterogeneous computing systems but may likewise be employed in other systems/architectures.