In current parallel computing systems, software and network interrupts incur high overhead. For example, packet arrival interrupts cause the network device to raise an interrupt, which is fielded by the operating system first level interrupt handler (“FLIH”). The FLIH then queries the device causing the interrupt. Based on the device that caused the interrupt, the appropriate device interrupt handler, e.g., a second level interrupt handler (“SLIH”) is called and takes whatever action is appropriate. In the case of a network interrupt, this action may include determining which user thread the packet arrival interrupt is associated with and making that user thread runnable so it may absorb the incoming packet into the ongoing computation. The overhead of going through these various steps and associated content switches is very high.
Another problem with current interrupt handling schemes involves interrupt targeting. The FLIH and SLIH run on whichever user thread that happens to be active at the time on the CPU which fields the interrupt. Since it is not clear to the dispatcher of the FLIH handler which process (running on some CPU) will eventually process and consume the incoming packet, the FLIH runs on some random CPU on the node, is funneled to CPU 0 every time, or the FLIH handling is rotated amongst the CPUs. Each of these selections has the effect of potentially disrupting one of the applications running on the CPU on which the FLIH is dispatched. Since parallel applications are typically well synchronized, this has an impact on the overall application performance.
One proposed solution to the above problems is to service multiple threads of execution in a single core. For example, various processing architectures support more than one thread of execution. Currently, multiple hardware threads are treated as a virtual CPU. In other words, a system with n physical CPUs appears to have m*n virtual CPUs, where m is the number of hardware threads. Each virtual CPU can concurrently execute an instruction stream.
However, for parallel systems this is not the most effective use of hardware threads. For example, the application must be split into a larger number of separate tasks to take full advantage of the CPU. Most parallel systems do not linearly scale. Therefore, the gain from instruction level overlap may be completely wiped out by inefficiencies in the parallelization of the problem. Also, parallel applications typically use a communication device, which may have to support a higher bandwidth to support more tasks. Additionally, large scale parallel applications are typically written with synchronization and load balancing in mind and become more sensitive to scheduling of other work on the CPUs.