The recent advent of the multicore era, wherein a single computing component with two or more independent actual central processing units (called “cores”) which read and execute program instructions, has refocused much of the hardware and software development activities on optimizing data cache locality and concurrency. Application performance is no longer gated solely by the core arithmetic unit performance. Rather, data locality, memory bandwidth, and concurrent execution are becoming primary metrics for potential performance increases. As data caches and core density continue to grow alongside transistor density, memory bandwidth has failed to grow at comparable rates. As a result, applications are increasingly forced to remain data cache aware when executing concurrently in order to reduce pressure on main memory. Many applications and kernels access memory in non-unit stride or irregular patterns making data cache locality difficult to achieve.
Traditional microprocessor designs are developed from the perspective of a given function pipe, resulting in the core being very myopic in focus whereby the operation of the core only considers the instructions and the cache mechanisms the core has access to and makes very little consideration regarding memory. In operation, when the core is presented with a long latency event (e.g., loading values from main memory) a significant amount of processing time is consumed (e.g., loading values from main memory is considerably slower than loading values from cache memory) and the function unit has very little ability to either predict that or do anything to mitigate the effects of the delay. In that period of time that the core is waiting for completion of the long latency event, the core is basically doing nothing or “stalling”. However, power utilization and processing performance are often important in processor-based system implementations. For example, power utilization is particularly important with respect to mobile applications, where during the aforementioned stalling the system is consuming power, and thus draining the battery. In the case of high performance computing, the aforementioned stalling wastes processor cycles, whereas high performance computing should make the most efficient use of every single possible clock cycle within a given processor core.
Efficient application concurrency can dramatically affect overall performance. For example, application concurrency may be utilized in multicore processor platforms to provide increased performance. Concurrency is the notion of multiple things happening at the same time, thus in implementations of application concurrency a plurality of applications or instructions thereof are performed in parallel by one or more processing units of the processor platform. When implemented on a single processor, multithreading generally occurs by time-division multiplexing (e.g., “multitasking”), wherein the processing unit switches between different threads (a program section, such as the smallest sequence of programmed instructions that can be managed independently by an operating system scheduler, declared to be a thread for execution by a processing unit). This context switching (i.e., switching between the threads executed by a processing unit) generally happens frequently enough that the user perceives the threads or tasks as running at the same time. When implemented on multiple processors (e.g., processing units of a multi-core system), the threads or tasks will actually run at the same time, with each processor unit running a particular thread or task. Such application concurrency may have a dramatic effect on overall performance, especially when the target applications access memory in irregular or non-unit stride patterns.
Introducing concurrency to an application typically involves the creation of one or more additional threads. Unfortunately, writing threaded code is challenging. Threads are low-level constructs that typically must be managed manually. The optimal number of threads for an application can change dynamically based on the current system load and the underlying hardware. Thus, implementing a correct threading solution can become extremely difficult. Moreover, the synchronization mechanisms typically used with threads add complexity and risk to software designs without any guarantees of improved performance. Accordingly, the implementation of efficient application concurrency in order to effectively affect overall performance presents challenges with respect to scheduling and controlling the multiplexing or use of multiple threads, with the object being to have each core performing processing functions on every clock cycle.
As memory hierarchies become increasingly complex and core density increasingly large, the pressures of effectively utilizing the operating system and software runtime scheduling mechanisms become exponentially more difficult to manage. Furthermore, few modern architectures and instruction sets provide adequate control over when and how thread and task concurrency is enforced. Application concurrency as implemented in the past largely relies on software runtime libraries and operating system scheduling paradigms.
For example, one common multithreading implementation implements periodic thread switching (referred to as simultaneous multithreading (SMT)) in which instructions from multiple threads are fetched, executed and retired on each cycle, sharing most of the resources in the core. Accordingly, the processing units of a SMT multithreading multicore platform implement periodic context switching (referred to herein as hardware interleaved context switching), whereby the processing units (e.g., cores of a multicore system) implementing multiple threads issue an instruction from a different thread every predetermined period (e.g., every cycle or every X cycles), as defined by the hardware configuration. That is, hardware interleaved context switching implements fixed time-based context switching where a predetermined, fixed time period is provided for executing each thread before switching to execution of a next thread. The periodicity of the hardware interleaved context switching is established a priori, without knowledge of the particular threads or the functionality being performed and without the ability of a the system or a user (e.g., a programmer) changing the switching period. Although such periodic thread switching may result in the core performing processing functions during most if not all (e.g., in the case of context switching on every cycle) clock cycles, the processing efficiency is often less than optimum due, for example, to the context switching being independent of the particular operations being performed.
Alternatively, common multithreading implementations implement event-based context switching (referred to as switch on event (SOE)) based multithreading in which instructions from a single thread are fetched, executed and retired while particular events (e.g., long latency memory operation events) are used to initiate switching between the different threads. Accordingly, the processing units of a SOE multithreading multicore platform implement SOE context switching, whereby the processing units implementing multiple threads issue an instruction from a different thread upon certain predefined events, as defined in the operating system and/or hardware configuration. That is, SOE context switching implements event based switching where the events are predetermined, usually very simplistic, and defined in the hardware implementation. The particular events for which context switching is provided are determined a priori, without knowledge of the actual way in which the system is being utilized or how the particular functions are being implemented. Such event based thread switching can result in less than optimum processor efficiency due to their being additional events which, in the context of the particular operation of the system, may result in long latencies, the particular events for which event based thread switching is provided may, in the context of the particular operation of the system, may not result in long latencies, etc.
In an attempt to provide increased performance in a multicore environment where data locality, memory bandwidth, and concurrent execution are the primary metrics for potential performance increases, several platforms have utilized data cache-less processor architectures coupled with novel memory systems in order to explore or optimize performance for applications and kernels without sufficient memory locality. For example, the Cray XMT combines the interconnect technology found in the Cray XT3 series of supercomputers with the processor originally designed for the Tera MTA. The Cray XMT is a distributed shared memory (DSM) system in which the globally shared address space is uniformly scrambled at very fine granularity on the different node memories. Rather than utilizing large memory request payloads, the Cray XMT relies on small messages and fine-grain thread parallelism to hide latencies to main memory and prevent arithmetic unit stalls. Each multithreaded barrel processor consists of the core processor logic, a DDR memory controller, HyperTransport chip interconnect logic and a switch that interconnects the aforementioned components. The core processor logic consists of 128 hardware streams, each of which is only permitted to have a single instruction in the pipeline at any given time. However, the Cray XMT requires a unique programming environment and compiler that is specifically crafted to expose sufficient parallelism in order to efficiently utilize the underlying hardware. Moreover, each node integrates a custom processing unit that switches context among numerous hardware threads on a cycle-by-cycle basis, as defined by the hardware configuration thereof (i.e., hardware interleaved context switching). Although the Cray XMT implementation of hardware interleaved context switching may result in improved performance with respect to the processing of the various threads (i.e., the parallel processing), as the threads of a particular application are brought together at the end of their processing the performance suffers due to the hardware interleaved context switching. That is, as the plurality of threads come together and serial processing of the application is performed, the cycle-by-cycle thread switching is performed extremely slow (e.g., the Cray XMT supports 128 threads, and thus the aforementioned serial processing receives 1/128th of the clock cycles).
Another example of a multicore platform developed to provide increased performance is the IBM CYCLOPS-64 system. The IBM CYCLOPS-64 architecture is built upon the notion of building a multiprocessor-on-chip using a large number of lightweight cores. In particular, the CYCLOPS-64 design includes 75 processors, each with two thread units and two 32 KB SRAM memory banks. No data caches are included on chip. However, each SRAM bank can be configured as a small scratchpad memory. The remaining portions of SRAM are grouped together to form a global memory that is uniformly addressable from all thread units. Unlike the Cray XMT, however, the programming model for the CYCLOPS-64 does not require platform-specific semantics. Rather, commodity parallel programming constructs such as OpenMP, a parallel programming API for shared-memory parallel programming, are utilized for parallelizing applications for the platform. In a further difference from the Cray XMT, the CYCLOPS-64 architecture provides a SOE multithreading multicore platform in which the processing units thereof implement SOE context switching, as defined in the CYCLOPS-64 hardware configuration.
Still another example of a multicore platform developed to provide increased performance is the Sun ULTRASPARK T2 system. The ULTRASPARK T2 architecture shares many features with the above described systems. Each ULTRASPARK T2 processor (codenamed Niagara2) contains eight SPARC cores. Each core supports the current execution of up to eight threads. In this manner, the Niagara2 processor is a chip-multithreading (CMT) architecture. The Niagara2 architecture also includes four memory controllers, two 10 Gb Ethernet controllers and a x8 PCI-Express channel on chip. In contrast to the above described systems, the ULTRASPARK T2 architecture includes explicit data caches. The eight SPARC cores support up to 64 concurrent threads sharing a 4 MB Level2 cache, which is divided into eight banks of 512 KB each. Fair cache sharing between multiple co-scheduled threads, however, has been shown to be a potential cause of serious problems such as threads starvation. Cache sharing can be extremely unfair, for example, when a thread with high miss rate and poor locality constantly causes evictions of other thread's data that will be required soon after. Moreover, as with the Cray XMT system, the SPARC cores in the ULTRASPARK T2 architecture employ a form of interleaved multithreading whereby the processing units switch context among numerous hardware threads on a cycle-by-cycle basis, as defined by the hardware configuration thereof (i.e., hardware interleaved context switching).
Such architectures and their associated instruction sets fail to provide efficient support for fine-grain application concurrency. Moreover, the user is provided no control over the scheduling of the threads, and instead the scheduling mechanisms, are established a priori in the hardware implementations. In particular, the thread scheduling implemented by the prior multithreading systems is part of the hardware implementation, and thus is fixed and not subject to subsequent modification or dynamic control.