The performance/cost ratio of traditional high performance systems is continuously being improved by commodity workstations and PCs connected via high performance interconnection networks. Such interconnection networks facilitate distributing complex applications for execution in parallel by a number of relatively inexpensive workstations. Utilization of existing resources to minimize the cost of high performance computing has led to a number of unique network topologies. Such networks are employed, by way of example, in computational GRIDS. See, e.g., I. Foster, C. Kesselman, The GRID Blueprint for a new Computing Infrastructure (Morgan Kauffman, 1999). GRIDs are composed of distributed and often heterogeneous computing resources, and along with clustered environments, GRIDs have the potential to provide great performance benefits to distributed applications.
A challenge to programmers of systems incorporating new network topologies is the increased complexity arising in the programming of applications to enable the applications, when executed, to efficiently utilize and exploit the distributed computing capabilities of the underlying systems. Conventional static analysis of tasks and events that may occur simultaneously is not sufficient because executing an application in a distributed environment requires an awareness and consideration of the dynamic conditions of the system. Such dynamic conditions include availability of system components and the relative computational and/or network performance of the system components. In addition some applications respond, during execution, to external events such as system faults and changes in the load of networks and computers.
In the past, analytical performance evaluation tools focused upon enabling computer system architects to design the underlying interconnection hardware and operating system without reference to the particular applications executed on the designed systems. There is a need for tools that assist programmers to create applications that exploit the parallel processing capabilities of the distributed computing systems developed by such system architects, or alternatively to allow system architects to design distributed processing systems that exploit the capability of particular tasks within software applications or programming architectures.
Past efforts to estimate performance of a communication/computer network can be grouped into the following approaches: analytical, statistical, and simulation. Each of these known approaches suffers from one or more shortcomings that limit the practical value of programming tools embodying such approaches.
Analytical approaches, such as queuing and characterization network models have enabled system architects to understand general performance characteristics of particular network architectures. However, the analytical techniques are based upon approximated load conditions rather than actual load conditions generated by an executed application. Therefore, while useful for identifying bottlenecks in a system running generic workloads, they are relatively poor predictors of the actual delays experienced when an application, or group of applications, are executed.
Statistical regression approaches take a particular set of conditions, establish and measure a response characteristic, and then seek to project the measured results by interpolating/extrapolating the observed results to a particular configuration of interest. While relatively easy to create and fast to evaluate in comparison to other approached, the results are typically inaccurate when applied to a particular network configuration and load condition due to the absence of considering the dynamic characteristics of the system including contention between simultaneously active processes for network resources as well as background loading. Thus accurate predictions of performance of a system under a particular application load using statistical regression is generally limited to quiet networks—an improbable assumption in many of the situations of interest to today's programmers. In addition they do not provide any insight into the operation of the network.
Simulation involves a detailed analysis (at various levels of detail) of commands executed by the program under test in the selected network environment. Simulation has the potential to provide a high level of accuracy with respect to the identified approaches. However, simulation involves stepping through each instruction and noting the system response—often at a physical or routing level. Such evaluations are potentially very time consuming—even with the assistance of computers. Yet another shortcoming of simulation tools is the difficulties encountered when tailoring execution to a specific workload or network configuration.