1. Field
Embodiments of the invention relate generally to parallel processing and more particularly to a technique for using a DMA reflect operation to unobtrusively query remote performance data.
2. Description of the Related Art
Powerful computers may be designed as highly parallel systems where the processing activity of hundreds, if not thousands, of processors (CPUs) are coordinated to perform computing tasks. These systems are highly useful for a broad variety of applications, including financial modeling, hydrodynamics, quantum chemistry, astronomy, weather modeling and prediction, geological modeling, prime number factoring, and image processing (e.g., CGI animations and rendering), to name but a few examples.
International Business Machines (IBM) has developed one family of parallel computing systems under the name Blue Gene®. The various Blue Gene architectures provide a scalable, parallel computer system. For instance, the Blue Gene/P system may be configured with a maximum of 256 racks, housing 8,192 node cards and 884,736 PowerPC 450 processors. The Blue Gene/P architecture has been successful and on Nov. 12, 2007, IBM announced that a Blue Gene/P system at Jülich Research Centre reached an operational speed of 167 Teraflops (167 trillion floating-point operations per second), making it the fastest computer in Europe at that time. Further, as of June 2008, the Blue Gene/P installation at Argonne National Laboratory achieved a speed of 450.3 Teraflops, making it the then third fastest computer in the world.
The compute nodes in a parallel system typically communicate with one another over multiple communication networks. For example, the compute nodes of a Blue Gene/P system are interconnected using five specialized networks. The primary communication strategy for the Blue Gene/P system is message passing over a torus network (i.e., a set of point-to-point links between pairs of nodes). The torus network allows application programs developed for parallel processing systems to use high level interfaces such as Message Passing Interface (MPI) and Aggregate Remote Memory Copy Interface (ARMCI) to perform computing tasks and distribute data among a set of compute nodes. Of course, other message passing interfaces have been (and are being) developed. Other parallel architectures also use MPI and ARMCI for data communication between compute nodes connected via a variety of network topologies. Typically, MPI messages are encapsulated in a set of packets which are transmitted from a source node to a destination node over a communications network (e.g., the torus network of a Blue Gene system).
Additionally, the compute nodes may contain various FIFOs in their local memory. For instance, the nodes may contain an injection FIFO in which commands may be stored. The node may monitor the injection FIFO to determine when commands have been added to the FIFO. For instance, a head pointer and a tail pointer may be used to monitor the FIFO. When the FIFO contains one or more commands, the node may then begin to execute those commands. The nodes may also contain other FIFOs, such as FIFOs containing data to be written to memory.