Operating system (OS) jitter refers to the interference experienced by an application due to scheduling of daemon processes and handling of asynchronous events such as interrupts. Existing approaches have shown that parallel applications on large clusters suffer considerable degradation in performance (for example, up to 100% degradation at 4096 processors) due to OS jitter. Several large scale high performance computing (HPC) systems such as, for example, Blue Gene/L and Cray XT4, avoid OS jitter by making use of a customized light-weight microkernel at the compute nodes. These customized kernels typically do not support general purpose multi-tasking and may not even support interrupts. However, these systems require applications to be modified or ported for their respective platforms.
Other existing systems make use of commodity OSes and still suffer from OS jitter. Such systems make use of various techniques to mitigate the effect of OS jitter. Existing techniques include synchronization of jitter across all nodes that can yield moderate (close to 50%) to very high (close to 300%) performance improvements. Existing approaches also use simultaneous multi-threaded (SMT) and hyper-threaded processors in mitigating jitter, but they may have other performance implications.
With a growing interest in the use of commodity OSes for HPC systems, there is a much greater need to develop and evaluate various techniques fox mitigating OS jitter. However, effectiveness of any technique to mitigate jitter should advantageously be evaluated in a large cluster with thousands of nodes. One of the biggest hindrances in the development and evaluation of new techniques for handling jitter is that there are a few large clusters running commodity OSes worldwide, which are often unavailable for experimental and validation purposes.
Emulating jitter on a large “jitter-free” platform using either synthetic jitter or real traces from commodity OSes has been proposed as a useful mechanism to study scalability behavior under the presence of jitter in existing approaches. Such approaches make use of a single node benchmark to measure jitter and inject synthetic jitter of varying length and periodicity on a jitter-less platform such as Blue Gene/L to study its impact on scalability of various collective operations. Such approaches also provide a comparison of the effect of synchronized and unsynchronized jitter on performance, and make use of purely synthetic jitter rather than collecting traces from real Linux systems. Also, existing approaches attempt to record real jitter traces and replay them to explore system performance.
The existing approaches to predict system performance noted above require an accurate methodology for precisely emulating jitter. However, existing approaches for introducing synthetic jitter suffer from several inaccuracies such as the ones caused due to system overhead of introducing jitter, resolution of timer (or sleep) calls, etc.