Modern processors typically include multiple hardware threads, allowing for the concurrent execution of multiple software threads on a single processor. Due to silicon area and power constraints, it is not possible to have each hardware thread be completely independent from other threads. Each hardware thread shares resources with the other threads. For example, execution units (internal to the processor), and memory and IO subsystems (external to the processor), are resources typically shared by each hardware thread. In many programs, at times a thread must wait for an action to occur external to the processor before continuing its program flow. For example, a thread may need to wait for a memory location to be updated by another processor, as in a barrier operation. Typically, for highest speed, the waiting thread would poll the address residing in memory, waiting for the thread to update it. This polling action takes resources away from other competing threads on the processor. In this example, the load/store unit of the processor would be utilized by the polling thread, at the expense of the other threads that share it.
The performance cost is especially high if the polled variable is L1-cached (primary cache), since the frequency of the loop is highest. Similarly, the performance cost is high if, for example, a large number of L1-cached addresses are polled, and thus take L1 space from other threads.
Multiple hardware threads in processors may also apply to high performance computing (HPC) or supercomputer systems and architectures such as IBM® BLUE GENE® parallel computer system, and to a novel massively parallel supercomputer scalable, for example, to 100 petaflops. Massively parallel computing structures (also referred to as “supercomputers”) interconnect large numbers of compute nodes, generally, in the form of very regular structures, such as mesh, torus, and tree configurations. The conventional approach for the most cost/effective scalable computers has been to use standard processors configured in uni-processors or symmetric multiprocessor (SMP) configurations, wherein the SMPs are interconnected with a network to support message passing communications. Currently, these supercomputing machines exhibit computing performance achieving 1-3 petaflops.
There is therefore a need to increase application performance by reducing the performance loss of the application, for example, reducing the increased cost of software in a loop, for example, software may be blocked in a spin loop or similar blocking polling loop. Further, there is a need to reduce performance loss, i.e., consuming processor resources, caused by polling and the like to increase overall performance. It would also be desirable to provide a system and method for polling external conditions while minimizing consuming processor resources, and thus increasing overall performance.