The popularity of computing systems continues to grow and the demand for improved processing architectures thus likewise continues to grow. Ever-increasing desires for improved computing performance/efficiency has led to various improved processor architectures. For example, multi-core processors are becoming more prevalent in the computing industry and are being used in various computing devices, such as servers, personal computers (PCs), laptop computers, personal digital assistants (PDAs), wireless telephones, and so on.
In the past, processors such as CPUs (central processing units) featured a single execution unit to process instructions of a program. More recently, computer systems are being developed with multiple processors in an attempt to improve the computing performance of the system. In some instances, multiple independent processors may be implemented in a system. In other instances, a multi-core architecture may be employed, in which multiple processor cores are amassed on a single integrated silicon die. Each of the multiple processors (e.g., processor cores) can simultaneously execute program instructions. This parallel operation of the multiple processors can improve performance of a variety of applications.
A multi-core CPU combines two or more independent cores into a single package comprised of a single piece silicon integrated circuit (IC), called a die. In some instances, a multi-core CPU may comprise two or more dies packaged together. A dual-core device contains two independent microprocessors and a quad-core device contains four microprocessors. Cores in a multi-core device may share a single coherent cache at the highest on-device cache level (e.g., L2 for the Intel® Core 2) or may have separate caches (e.g. current AMD® dual-core processors). The processors also share the same interconnect to the rest of the system. Each “core” may independently implement optimizations such as superscalar execution, pipelining, and multithreading. A system with N cores is typically most effective when it is presented with N or more threads concurrently.
One processor architecture that has been developed utilizes multiple processors (e.g., multiple cores), which are homogeneous in that they are all implemented with the same fixed instruction sets. Further, the homogeneous processors may employ a cache memory coherency protocol, as discussed further below.
In general, an instruction set refers to a list of all instructions, and all their variations, that a processor can execute. Such instructions may include, as examples, arithmetic instructions, such as ADD and SUBTRACT; logic instructions, such as AND, OR, and NOT; data instructions, such as MOVE, INPUT, OUTPUT, LOAD, and STORE; and control flow instructions, such as GOTO, if X then GOTO CALL, and RETURN. Examples of well-known instruction sets include x86 (also known as IA-32), x86-64 (also known as AMD64 and Intel® 64), VAX (Digital Equipment Corporation), IA-64 (Itanium), and PA-RISC (HP Precision Architecture).
Generally, the instruction set architecture is distinguished from the microarchitecture, which is the set of processor design techniques used to implement the instruction set. Computers with different microarchitectures can share a common instruction set. For example, the Intel® Pentium and the AMD® Athlon implement nearly identical versions of the x86 instruction set, but have radically different internal microarchitecture designs. In all these cases the instruction set (e.g., x86) is fixed by the manufacturer and directly implemented, in a semiconductor technology, by the microarchitecture. Consequently, the instruction set is fixed for the lifetime of this implementation.
Memory coherency is an issue that affects the design of computer systems in which two or more processors share a common area of memory. In general, processors often perform work by reading data from persistent storage into memory, performing some operation on that data, and then storing the result back to persistent storage. In a uniprocessor system, there is only one processor doing all the work, and therefore only one processor that can read or write the data values. Moreover a simple uniprocessor can only perform one operation at a time, and thus when a value in storage is changed, all subsequent read operations will see the updated value. However, in multiprocessor systems (e.g., multi-core architectures) there are two or more processors working at the same time, and so the possibility that the processors will all attempt to process the same value at the same time arises. Provided none of the processors updates the value, then they can share it indefinitely; but as soon as one updates the value, the others will be working on an out-of-date copy of the data. Accordingly, in such multiprocessor systems a scheme is Generally required to notify all processors of changes to shared values, and such a scheme that is employed is commonly referred to as a “cache coherence protocol.” Various well-known protocols have been developed for maintaining cache coherency in multiprocessor systems, such as the MESI protocol, MSI protocol, MOSI protocol, and the MOESI protocol, are examples. Accordingly, such cache coherency generally refers to the integrity of data stored in local caches of the multiple processors.
FIG. 1 shows an exemplary prior art system 10 in which multiple homogeneous processors (or cores) are implemented. System 10 includes a first microprocessor 11 and a second microprocessor 12 with an interconnecting bus 13 between them. In this example, microprocessors 11 and 12 are homogeneous in that they are each implemented to have the same, fixed instruction set, such as x86. A main memory 11-1 is connected to microprocessor 11, and a main memory 12-1 is connected to microprocessor 12. Main memories 11-1 and 12-1 are in the same physical address space so that microprocessors 11 and 12 can each reference either of the two main memories, 11-1 or 12-1. A cache coherence protocol is implemented across the busses in order to allow the microprocessors to get the latest value of the memory wherever it currently exists.
As an example, a global physical memory address format 14 is implemented, which has a value that includes a node number 14-1 and an offset 14-2. In this example, all elements labeled 11, including microprocessor 11 main memory 1-11 cache 11-2, and Translation Look-aside Buffer (TLB) 11-3, make up a first node 101, while all elements labeled 12, including microprocessor 12, main memory 12-1, cache 12-2, and TLB 12-3, make up a second node 102. In the global physical memory address format 14, node number 14-1 indicates which node the actual physical memory resides on, and the offset 14-2 is the offset into the actual physical memory on that node.
In many architectures, virtual addresses are utilized. In general, a virtual address is an address identifying a virtual (non-physical) entity. As is well-known in the art, virtual addresses may be utilized for accessing memory. Virtual memory is a mechanism that permits data that is located on a persistent storage medium (e.g., disk) to be referenced as if the data was located in physical memory. Translation tables, maintained by the operating system, are used to determine the location of the reference data (e.g., disk or main memory). Program instructions being executed by a processor may refer to a virtual memory address, which is translated into a physical address. To minimize the performance penalty of address translation, most modern CPUs include an on-chip Memory Management Unit (MMU), and maintain a table of recently used virtual-to-physical translations, called a Translation Look-aside Buffer (TLB). Addresses with entries in the TLB require no additional memory references (and therefore time) to translate. However, the TLB can only maintain a fixed number of mappings between virtual and physical addresses; when the needed translation is not resident in the TLB, action will have to be taken to load it in.
As an example, suppose a program's instruction stream that is being executed by a processor, say processor 11 of FIG. 1, desires to load data from an address “Foo” into a first general-purpose register, GPR1. Such instruction may appear similar to “LD <Foo>, GRP1”. Foo, in this example, is a virtual address that the processor translates to a physical address, such as address “123456”. Thus, the actual physical address, which may be formatted according to the global physical memory address format 14, is placed on bus 11-4 for accessing main memory 11-1, for example. Cache coherency is maintained in that if processor 12 is also executing instructions that are attempting to access “Foo” (the physical address 123456) at the same time that processor 11 is accessing it, then the cache coherency scheme resolves this to allow the microprocessors to get the latest value of Foo.
As show in the example of FIG. 1, a cache is contained within each of the individual microprocessors, shown as cache 11-2 within microprocessor 11 and cache 12-2 within microprocessor 12. Each microprocessor first attempts to access data out of its respective cache, and when it references data that is not in its cache, it looks at main memory 11-1 and 12-1 using the global physical memory address format 14. From the node number 14-1, the microprocessor decides if the physical address or physical memory is associated with the memory of that node or if it must traverse to a remote node in order to access the remote node's memory. For example, when microprocessor 11 attempts to access data, it first attempts to access the data in its cache 11-2. If the data is not in cache 11-2, then microprocessor 11 evaluates node number 14-1 of the global physical memory address format of such data. If node number 14-1 identifies node 101 (or main memory 11-1 of node 101), then microprocessor 11 determines that the data is contained in main memory 11-1, and if node number 14-1 identifies node 102 (or main memory 12-1 of node 102), then microprocessor 11 determines that the data is contained in main memory 12-1. The data may be read from the main memory of the identified node, and may be stored to microprocessor 11's cache 11-2 for ready subsequent accesses.
Also shown in FIG. 1 is that each microprocessor includes a TLB, such as TLB 11-3 of microprocessor 11 and TLB 12-3 of microprocessor 12. As mentioned above, and as is well-known in the art, the role of the TLB is to translate from a virtual address (e.g., “Foo” in the above example) to the physical address. Thus, TLBs 11-3 and 12-3 are responsible for performing virtual-to-physical address translation. Accordingly, when microprocessor 11 issues a load request, it presents the virtual address to TLB 11-3. TLB 11-3 looks up the virtual address in its table, and if found, then TLB 11-3 outputs a physical address that is used to actually access cache 11-2 or main memory 11-1. If the virtual address does not exist in TLB 11-3, then microprocessor 11 walks a series of tables that are located in main memory 11-1 to find a TLB entry to be placed in TLB 11-3 to complete the reference and all future references for that virtual address.
Thus, in this exemplary architecture, multiple homogeneous processors (e.g., processors 11 and 12) that each have common, fixed instruction sets may be implemented with a cache coherency protocol. Such a multi-processor system that has multiple homogeneous processors provides many benefits. One significant benefit of this type of multi-processor system is that a compiler can generate one executable file (e.g., a single “a.out” executable, which is well known in the art as the UNIX definition of an executable image), which may have its instructions processed by the multiple homogeneous processors. This is important for programmer productivity. Another benefit of this exemplary multi-processor system is its elimination of the need of a programmer to manage the location of data. Without a cache coherency mechanism, the programmer must explicitly manage and move data for use by an executable, which lowers programmer productivity and lowers application performance. However, because the processors are homogeneous, they each have the same processing capabilities, and thus one processor is not better suited for performing certain types of operations more efficiently than the other.
In some architectures, special-purpose processors that are often referred to as “accelerators” are implemented to perform certain types of operations. For example, a processor executing a program may offload certain types of operations to an accelerator that is configured to perform those types of operations efficiently. Such hardware acceleration employs hardware to perform some unction faster than is possible in software running on the normal (general-purpose) CPU. Hardware accelerators are generally designed for computationally intensive software code. Depending upon granularity, hardware acceleration can vary from a small functional unit to a large functional block like motion estimation in MPEG2. Examples of such hardware acceleration include blitting acceleration functionality in graphics processing units (GPUs) and instructions for complex operations in CPUs. Such accelerator processors generally have a fixed instruction set that differs from the instruction set of the general-purpose processor, and the accelerator processor's local memory 16-2 does not maintain cache coherency with the general-purpose processor.
A graphics processing unit (GPU) is a well-known example of an accelerator. A GPU is a dedicated graphics rendering device commonly implemented for a personal computer, workstation, or game console. Modern GPUs are very efficient at manipulating and displaying computer graphics, and their highly parallel structure makes them more effective than typical CPUs for a range of complex algorithms. A GPU implements a number of graphics primitive operations in a way that makes running them much faster than drawing directly to the screen with the host CPU. The most common operations for early two-dimensional (2D) computer graphics include the BitBLT operation (combines several bitmap patterns using a RasterOp), usually in special hardware called a “blitter”, and operations for drawing rectangles, triangles, circles, and arcs. Modern GPUs also have support for three-dimensional (3D) computer graphics, and typically include digital video-related functions.
FIG. 2 shows an exemplary prior art system architecture in which accelerator hardware 16, such as GPU 16-1 is implemented. The accelerator hardware 16 is capable of being called by host processors 11 and 12 via input/output (I/O) 15. Thus, for instance graphics operations of a program being executed by host processors 11 and 12 may be passed to GPU 16-1 via I/O 15. While the homogeneous host processors 11 and 12 maintain cache coherency with each other, as discussed above with FIG. 1, they do not maintain cache coherency with accelerator hardware 16 (e.g., GPU 16-1). This means that CPU 16-1 reads and writes to its local memory are NOT part of the hardware-based cache coherency mechanism used by nodes 101 and 102. This also means that GPU 16-1 does not share the same physical or virtual address space of nodes 101 and 102.
Thus, in this exemplary architecture, heterogeneous processors that each have different, fixed instruction sets may be implemented. That is, general-purpose processor(s), such as processors 11 and 12, may be implemented having a first instruction set, and an accelerator processor, such as GPU 16-1, may be implemented having a different instruction set for performing certain types of operations efficiently. A cache coherency protocol is not used between the heterogeneous processors (e.g., between general-purpose processors 11, 12 and accelerator processor 16).
Accordingly, in some architectures a plurality of homogeneous processors are implemented that each have fixed, common instruction sets and which maintain cache coherency with each other. And, in some architectures a plurality of heterogeneous processors are implemented that have fixed, different instruction sets (e.g., in FIG. 2 host processors 11 and 12 each have a fixed first instruction set, such as x86, and accelerator 16 provides a heterogeneous processor with a different fixed instruction set), wherein cache coherency is not maintained across the heterogeneous processors.
Application-specific integrated circuits (ASICs) are known, which are commonly implemented as custom designs. An ASIC is an integrated circuit (IC) customized for a particular use, rather than intended for general-purpose use. For example, an ASIC may be implemented as a chip that is designed solely to run a cellular telephone. In contrast, the well-known 7400 series and 4000 series integrated circuits are logic building blocks that can be wired together for use in many different applications. As feature sizes have shrunk and design tools improved over the years, the maximum complexity (and hence functionality) possible in an ASIC has grown from 5,000 gates to over 100 million. Modern ASICs often include entire 32-bit processors, memory blocks including ROM, RAM, EEPROM, Flash and other large building blocks. Designers of digital ASICs generally use a hardware description language (HDL), such as Verilog or VHDL, to describe the functionality of ASICs. Once a design is completed and a mask set produced for a target chip, an ASIC is created. The configuration is created once. If a new configuration is needed, an entirely new design is needed. Thus, ASICs are not field-programmable.
Additionally, various devices are known that are reconfigurable. Examples of such reconfigurable devices include field-programmable gate arrays (FPGAs). A field-programmable gate array (FPGA) is a well-known type of semiconductor device containing programmable logic components called “logic blocks”, and programmable interconnects. Logic blocks can be programmed to perform the function of basic logic gates such as AND, and XOR, or more complex combinational functions such as decoders or simple mathematical functions. In most FPGAs, the logic blocks also include memory elements, which may be simple flip-flops or more complete blocks of memories. A hierarchy of programmable interconnects allows logic blocks to be interconnected as desired by a system designer. Logic blocks and interconnects can be programmed by the customer/designer, after the FPGA is manufactured, to implement any logical function, hence the name “field-programmable.”
Further, various software compilers are known in the art for generating executable flies that contain instructions to be processed by a processor. Traditional compilers generate instructions for a fixed instruction set of a micro-processor, such as for a micro-processor having an x86 or other fixed instruction set. The generated instructions are included in an executable image that can be executed by a micro-processor (or multiple homogeneous processors, such as those of FIG. 1) having the fixed instruction set (e.g., x86).