It is always important in a parallel or clustered computing environment to achieve the best performance. Conventionally, a parallel network is fine-tuned by trial and error, often by trying various network setups until one deemed optimum is arrived at.
Clusters of commodity systems are becoming the dominant platform for high performance computing (HPC), currently making up more than half of the TOP 500 list of the world's fastest supercomputers. Scientists and engineers use clusters to split up an application into a number of cooperating elements, working in parallel on small chunks of the overall problem. These elements are distributed across the individual computers in a cluster, and communicate using, for example, the Message Passing Interface (MPI) Standard.
Within the HPC community, achieving good performance is acknowledged to be a difficult task. It requires expertise, time, and resources. It is particularly difficult to tune applications for commodity clusters formed of commodity machines, as there are few suitable performance tools available; most of those that exist are aimed at standalone systems, not HPC clusters. An efficient parallel application scales almost linearly; when run on ten CPUs, it will run almost ten times faster than on one. Good scaling is difficult to achieve, even to a modest number of CPUs. Not only will an application fail to approach the peak advertised performance of its cluster, the performance curve quickly levels off—and frequently even drops—as the size of the cluster increases. In fact, it is so difficult to scale MPI application performance that managers of clusters at HPC facilities often limit their users to running applications on no more than 16 or 32 CPUs at a time, based on anecdotal belief that adding additional processors will not improve or will decrease performance. Using larger numbers of CPUs yields so little benefit for many applications that the extra compute power is effectively wasted.
A seasoned developer of parallel applications has a “toolbox” of techniques for tuning an MPI application to perform better. Typical approaches to finding performance problems include:                Time the application's performance at different cluster sizes, and plot the speedup they achieve for each size. For an untuned application, this curve will quickly flatten out, at about 4 CPUs.        Profile the serial portions of the code, to find and fix per-CPU bottlenecks. This is often sufficient to bring the cluster size where an application starts “losing steam” from 4 CPUs up to perhaps 16.        Instrument the application to measure the amount of time it spends computing and communicating. Compare the ratio of these two values at different cluster sizes. Find the communication hot spots as the cluster size grows, and fix them.        
Each of these techniques requires a substantial amount of manual work—instrumenting the application; cataloging performance numbers; plotting charts; tweaking the application's behavior; and repeatedly trying again. It is difficult to apply them blindly; a parallel programmer has to develop a body of experience to know which methods to try, and which numbers are significant.