Over the last few decades the peak rate at which microprocessors are able to execute instructions has increased dramatically, thus increasing performance and decreasing the time required to perform useful applications. This increase has been due in large part to two separate trends: an increase in the basic rate at which basic steps may be carried out (termed clock rate and measured in units of gigahertz or GHz), and an increase in the average number of instructions that may be started from a single program in any one of these clock cycles (termed Instruction Level Parallelism or ILP, and measured in units of Cycles per Instruction or CPI, or its numeric reciprocal Instructions per Cycle or IPC). In most cases one cycle corresponds to one clock period (thus, N GHz corresponds to N billion cycles per second). Both of these factors have been affected by improvements in the semiconductor technology used to create logic, specifically, the individual transistors are getting faster and smaller.
Unfortunately, in many real systems the actual delivered performance has not been growing at the same rate, primarily because these same improvements in semiconductor technology, when applied to the memory side of computers, has been used to increase the number of bits per unit area of silicon chip, and not the speed of access. Thus, there has been much discussion of the emergence of the “memory wall.” Relatively speaking, it is taking more and more CPU cycles for an instruction to reach across from where it is interpreted in the chip containing the logic of the Central Processing Unit (or CPU) to the chips containing the memory. In general, the round trip time of such an access is referred to as the memory latency and it is measured in clock cycles. Today, with CPU clocks in the 4+ GHz range, and memory up to 200 nanoseconds away, such latency may approach a thousand cycles. It is expected that this wall will only increase with future technology improvements.
A variety of techniques have been employed to increase total performance by increasing the number of activities being performed on behalf of a program by the CPU. A first example is deep pipelining, where the basic CPU logic is broken into a series of smaller circuits that can all be run concurrently on different instructions that enter the pipeline one at a time in program order. A second example is superscalar execution, where multiple different instructions are attempted to be started at the same time. Variations of this latter technique include in-order execution (where only instructions that are known to follow each other as seen by a programmer may be launched simultaneously), out-of-order execution (where some subset of “ready to run” instructions are chosen arbitrarily from a pool of potential instructions), and speculative execution (where instructions are started even when it is not clear that the program will want them executed).
In all of these cases, significant logic is expended to ensure that the results computed by executing all these instructions concurrently is the same as if the instructions had been executed “one at a time,” from start to finish, in program order (the order the programmer had in mind when the program code was written). Techniques such as bypassing, forwarding, or short-circuiting are used to move copies of data computed by some first instruction directly to the logic trying to execute some second instruction that needs such results, without waiting for the first instruction to complete and deposit its result in the target register specified by the instruction, and from which the second instruction should logically retrieve it.
The above techniques become quite difficult when the result to be provided by some first instruction must come from memory. The completion of this first instruction must now be suspended for potentially very long periods of time, as must the operation of a second instruction, and thus, any instructions that are dependent on it. When there are many such memory reference instructions (as there are in most real programs), the CPU logic quickly runs out of logic to keep track of all the suspended instructions, and thus, becomes incapable of starting additional instructions. The CPU grinds to a halt while the memory references are processed.
To reduce the memory latency, and thus, reduce the time in which CPUs are running at less than peak rate due to memory operations, caches have been introduced to keep copies of different parts of memory in storage closer to the CPU (usually on the same chip as the CPU). Many machines today have two or even three levels of caches of different sizes and speeds to help with the illusion of memory being closer to the CPU. However, given the lengths of latencies when the required data is not in cache, even an occasional reference that misses the cache can severely impact performance.
Complicating such situations is the need to keep some order to the sequence in which memory is accessed, particularly when instructions that are to change memory (stores) are interleaved with those that simply wish to read memory (loads). Most computer designs define some memory consistency model that specifies the order in which memory operations are to be performed in relation to the program order of the instruction execution. This requires additional logic, and additional, largely invisible to the programmer, information to keep track of which memory instructions are pending, what are the addresses of the memory locations they are accessing, and what was the order of these instructions in relation to program order. This logic must know how to determine when certain memory requests must be delayed until the completion of others, and when it is safe to allow operations to proceed. Such functionality is often found in load queues, store buffers, etc.
The temporary storage needed to hold temporary copies or keep track of these instruction dependencies is generally referred to as part of the programmer invisible machine state. This is in contrast to the programmer visible machine state, which consists of all the registers and program control information that persists from instruction to instruction and is visible to, or under the explicit control of, the programmer. Examples of the latter include the register file, the program counter, status registers, etc. In virtually all modern computer architectures, the invisible state significantly exceeds the visible state in size (number of bits of information).
Classical parallel processing tries to beat this memory wall in an orthogonal way by building systems with multiple CPUs and multiple memories, and writing programs that are explicitly broken into smaller subprograms that may be run independently and concurrently, usually on different CPUs. When each of these subprograms controls its own memory, CPU, and other resources, it is often referred to as a process, and the technique of running different communicating processes on different processors is called multi-processing.
Multi-processing generally has two major variants. In Shared Memory systems, even though no particular memory unit may be physically near some specific CPU, any program running on any processor may perform loads and stores to any memory anywhere in the system (obviously some references may take longer than others). In Distributed Memory systems, each CPU has its own memory (with the combination called a node), and memory references from a CPU are allowed only to that CPU's memory. To communicate with a different node, a program must explicitly manage some sort of communication via a message that is handled by some sort of node-to-node communication mechanism.
In many cases all the memory for a Shared Memory Multiprocessor (SMP) is physically on the same memory bus with all the CPUs. However, it is possible to build machines where the memory is physically distributed to the different CPUs, but logically accessible to all CPUs via interconnection networks that tie all such nodes together. Such systems are generally referred to as Distributed Shared Memory (DSM) systems.
When each subprogram owns only its own set of registers, and the memory and other resources are owned and managed at a higher level, it is generally referred to as a thread, and the technique of running multiple threads at the same time is called multi-threading. Modern parallel programming languages such as UPC make such threads visible to the programmer and allow the programmer explicit control over the allocation of different threads to different parts of the program's execution.
Multi-threading has been implemented in a variety of ways. Most simply, a single CPU is able to run a single thread uninterrupted for a while, and then stop, save the thread's visible state, especially registers, to memory, load the register values for some other thread into the CPU registers, and run that thread for a while. Doing so, however, requires that before a thread's registers are saved to memory, all activity associated with that thread must come to some sort of completion, where nothing in the CPU's invisible state is needed to restart the thread at a later point.
Keeping multiple sets of registers in the CPU, and simply selecting which one of them will control the CPU logic, may greatly reduce the cost of a thread switch. However, when one thread is running, any delays due to memory or dependencies will result in delaying not only the current thread, but also the time at which a different thread may be given control and allowed to execute.
With multiple sets of register files within the CPU, additional strategies may be used to reduce dead time in the CPU. For example, when one thread reaches a situation where no forward progress may be made for a while (such as all instructions from one thread are blocked waiting for a long memory reference to resolve itself), new instructions may be issued in support for some other thread.
Such techniques may greatly increase the amount of internal invisible machine state within the CPU, and usually requires adding tags to each such item to identify to which thread (signified by which physical register file) the state information belongs.
Simultaneous multi-threading (SMT) may take this process one step further by allowing instructions from not one, but multiple threads to be simultaneously in execution within a single CPU's logic. How and when different instructions from different threads are issued into the logic may be as simple as issuing a few instructions (often only one) from one thread and then switching to issuing instructions from another thread, before the first instructions have completed. It may be as complex as enhanced superscalar designs, where one or more instructions from different threads are started at exactly the same time.
In many of these designs, once a thread has had an instruction issued into a CPU on its behalf, no more instructions are allowed to be issued for that thread until the first one completes. With a sufficient number of threads available to the CPU, this means that the often significant costs associated with inter-instruction dependency checking, tracking, and associated invisible state need not be implemented. This results in a much cleaner and simpler design, and one that may often actually drive the hardware to higher levels of utilization.
One limitation of current multi-threaded architectures that explicitly allow multi-threaded programming is that there are usually some fixed maximum number of threads that the hardware may support, that each of these threads is bound to some specific node, and the threads that are actually in use by the program are also fixed, and managed explicitly by the program. In a real sense, each “physical” thread has a unique name (that which identifies the set of registers it owns) that is part of the programming model. Thus, reallocating the physical resources associated with a thread is an explicit high-level software function. This includes redirecting a particular hardware thread resource set to perform some other portion of an application.
Once multiple threads are available to support an application, a very common next requirement is that different threads that support the same application may very well need to exchange data between themselves and otherwise synchronize their behavior during the execution of the overall program.
Synchronizing their behavior may take several forms. Shared memory locations may be used as locks, semaphores, or monitors to restrict concurrent access to some critical section of code to some limited number of threads (usually just a single thread) at a time. Threads that find the critical section occupied must wait until one or more of the threads currently within the critical section exit. Barriers are a variant of this where no thread is allowed to pass a certain point in a program until some defined subset of other currently active threads (usually all of them) have also reached the same point.
Finally, there is very often a defined ordering to the threads and what data they process. Under producer-consumer programs, some thread is responsible for computing, or “producing,” all elements of some stream of data items, and some other thread is responsible for performing some further computation on these items, i.e. “consuming” them. Of course there may be a chain of such threads, with one thread being both the consumer of one stream of data and the producer of another. What is key here is the exchange of data items from the producer to the consumer, since one usually wishes to ensure that no consumer starts processing until it is assured that the data being processed comes from the producer in its final form, and that all producers want to be assured that all data items generated by them will in fact be delivered to a consumer, and usually in the order of production. Producer-consumer semantics is the name usually given to the mechanisms employed by the program to guarantee these constraints.
In terms of languages, UPC has properties that begin to dovetail nicely with such architectures, albeit without explicit producer/consumer functionality. UPC explicitly supports multiple threads within at least a partially shared memory model. Each thread has an “affinity” to a particular region of shared memory, as determined by address, but can freely access other regions of shared memory. Objects may be mapped into this shared memory space so that it is known which objects are in affinity with which threads. In addition, each thread has access to a private, local, memory space that is inaccessible to the other threads.
The programming language Java has some different but again relevant properties. It has explicit support for both the concept of threading, for relatively unlimited multi-threading, and for both a shared and a private working memory. However, being an object-oriented language, it hides the underlying machine's address space from the programmer. Instead, Java allows virtually any object class to have a thread definition attached to it. Thus, when an instance of such an object is created, the attached run method can be invoked, starting a thread that has access to the object's components. This thread runs until it decides to terminate itself, although there is a mechanism for other threads to post an “interrupt” to the object that the thread can explicitly test. Likewise, through appropriate method calls, external threads can gain access to the object's components, and thus interact with the object's resident thread. Finally, there are methods that allow one thread to wait for termination of another thread, and for a thread to select its priority for execution.
While Java doesn't support explicit consumer/producer semantics, it does support synchronized methods, whereby once a method of such a type has been invoked against a certain object, that object is “locked out” from being accessed with a synchronized method by another thread until the first method completes. The Java Virtual Machine assumes one such lock/monitor for each object, along with instructions to acquire and release them.
Modern semiconductor technology now allows multiple CPUs, each called here a core, to be placed on the same semiconductor die, and share with each other caches and other memory structure. Such chips are called multi-core chips. A chip referred to as EXECUBE was arguably the first such design to do this with 8 independent and complete CPU cores on the same die, along with memory. Virtually all modern microprocessor vendors now offer such multi-core chips, albeit with on-chip caches rather than on-chip memory. There is no constraint as to what kind of CPU each core implements: pipelined, superscalar, or multi-threaded.
A separate semiconductor fabrication technique termed Processing-In-Memory (PIM), Merged Memory and Logic Devices (MLD), or Intelligent RAM (IRAM), now allows such cores to be placed on not a logic chip but a high density memory die, greatly reducing the latency when memory references made by the CPU are to the on-chip memory. Again, the EXECUBE chip was arguably the first to have done so with a DRAM memory technology.
This technique has not yet achieved widespread use because the amount of memory typically desired per microprocessor CPU today is much more than may be placed on a single silicon die. In addition, when the base semiconductor technology is high density DRAM, the transistors used for on-chip logic are often noticeably slower than those on a die made for high speed logic.
Thus, much of what is being done classically, both at the Instruction Set Architecture (ISA—description of the instructions and program-visible data structures from which programs may be constructed and what they perform when executed) and microarchitectural levels (what are the major building blocks of a computer and how they implement the ISA) is increasing the amount of state information that needs to be maintained at the site of execution of a program thread. This increasing state is having a chain reaction effect on computer designs: more state is added to help reduce the apparent gap to memory, and ends up burying, deeper and deeper into a chip and further and further from memory, the core logic that actually does something useful, which in turn requires more state to overcome the effects of increasing distance. Virtually all of the predictive and caching schemes developed over the past few decades have had such an effect.
The raw latencies get even worse when one considers highly parallel systems where memory may literally be on the other side of the machine room.
Furthermore, these memory wall problems are due largely to a lack of re-examination of the underlying execution model: they are due to the assumption that there are at best a very limited number of function unit blocks of logic that perform the basic processing, that these function units need to be separated from memory, and that the purpose of the surrounding core logic is to transfer data to and from such logic in the overall utilization of these function units.
This does not conform to the current state of technology. The average personal computer today has several thousand square millimeters of silicon, most of which is cheap memory, and where at best a very few square millimeters of logic (the function units) buried in the single most expensive chip (in cost, power, area, complexity, etc.) are what modern designs are trying so heroically to be used efficiently.