1. Field
The present embodiments relate to memory correctness checking as employed, for example, on massively parallel processors, supercomputers and other distributed computer systems. Memory correctness checking is a technology for auditing and tracking the allocation and initialization state of computer storage so that incorrect usage of storage and undesired program behavior can be detected. Typically, memory correctness checking is performed during the testing or debugging phase of program development.
2. Description of the Related Art
Examples of apparatus to which the present embodiment may be applied are schematically shown in FIGS. 1 and 2.
In FIG. 1, a conventional computer 1 forms one node of a distributed computer system. The computer 1 has a processor 2 for processing user data, and a network interface controller (NIC) 3 for interfacing with an external network 7, enabling the computer to communicate with other nodes in the distributed computer system. The computer 1 is connected to external storage 4 as well as having its own built-in storage 8. An input device 5 is used to give instructions to the computer, and an output device presents results in the form of a graphical display for example. As is well known, a user interacts with an operating system (OS) of the computer when inputting instructions, such as to execute a given program or process. Where such instructions result in a need for data to be sent to, or fetched from, another node in the system, the NIC handles the necessary data transfers. Supervision of the NIC is generally at the program or OS level, without directly involving the user.
FIG. 2 shows a network-on-chip processor 10 with a network 70 (indicated by the solid grid lines) linking discrete processing elements 20. Each processing element 20 may be multi-core and may have its own storage. In reality, there may be hundreds of processing elements rather than the sixteen depicted. Network interface controllers 30 are provided respectively for each of the processing elements 20 as well as for an external storage 40 which is shared amongst the processing elements. This network-on-chip processor 10 may be used as the processor in a conventional computer.
Remote Direct Memory Access is a technology allowing a conventional computer, as shown in FIG. 1, to use its network interface controller 3 to transmit information via the network to modify the storage at a second conventional computer. This technology is important in high performance computing, where the first and the second computers are part of a supercomputer, as it reduces the work placed on the processor 2 of the computer shown in FIG. 1. RDMA technology is also beneficial to the network-on-chip processor 10 of FIG. 2 as a processing element 20 is able to modify storage local to a second processing element in a way that minimizes the work placed on the second processing element.
RDMA relies on single-sided communication, also referred to as “third-party I/O” or “zero copy networking”. In single-sided communication, to send data, a source processor (under control of a program or process being executed by that processor) simply puts that data in the memory of a destination processor, and likewise a processor can read data from another processor's memory without interrupting the remote processor. Thus, the operating system of the remote processor is normally not aware that its memory has been read or written to. The writing or reading are handled by the processors' network interface controllers (or equivalent, e.g. network adapter) without any copying of data to or from data buffers in the operating system (hence, “zero copy”). This reduces latency and increases the speed of data transfer, which is obviously beneficial in high performance computing.
Consequently, references in this specification to data being transferred from one processor to another should be understood to mean that the respective network interface controllers (or equivalent) transfer data, without necessarily involving the host processors themselves.
Conventional RDMA instructions include “rdma_put” and “rdma_get”. An “rdma_put” allows one node to write data directly to a memory at a remote node, which node must have granted suitable access rights to the first node in advance, and have a memory (or buffer) ready to receive the data. “rdma_get” allows one node to read data directly from the memory (or memory buffer) of a remote node, assuming again that the required privileges have already been granted.
The Message Passing Interface (MPI) is the most widely accepted standard for communication between nodes (which may be conventional computers) of a massively parallel computer. MPI provides a message-passing library specification capable of being applied to a wide range of distributed computer systems including parallel computers, clusters and heterogeneous networks, and is not dependent on any specific language or compiler. MPI allows communication among processes which have separate address spaces. The basic version involves co-operative (two-sided) communication, in which data is explicitly sent by one process and received by another. A later version of the standard, MPI-2, includes support for single-sided communication which gives RDMA functionality but does not provide direct support for memory correctness checking.
Other standards, implemented generally at the software level, exist for communication between nodes of a parallel computing system. Among these are PVM (Parallel Virtual Machine), SHMEM (Shared Memory) and ARMCI (Aggregate Remote Memory Copy Interface).
Meanwhile, a number of tools exist that support memory correctness checking. IBM's Rational Purify performs binary instrumentation to track four states of memory as shown in FIG. 3. Here, it is assumed that a processor is executing a program, which program must “own” a given byte of memory in order to have access rights to it. As well as reading and writing of data, it is possible to free a byte of memory so as to make it available to another program (or processor). The function “malloc” (memory allocate) is used to allocate or free memory. The four possible memory states are:
(i) Neither allocated nor initialized (see the area labelled 91). This is so-called “Red memory” which is illegal to read, write or free since it is not owned by the program.
(ii) Allocated but not initialized (“Yellow memory”, labelled 92 in FIG. 3). This is memory which is owned by the program but which has not yet been initialized. It may be written to or freed, but not read.
(iii) Both allocated and initialized (“Green memory”, 93 in FIG. 3). This is memory which has been written to and thus has a value capable of being read. It is legal to read, write or free Green memory.
(iv) Freed and previously initialized (“Blue memory”, 94 in FIG. 4). An area of memory which has been initialized and used, but is now freed. That is, the memory is still initialized but no longer valid for access. It is therefore illegal to read, write or free Blue memory.
Two bits are used to track each byte of memory: the first bit records allocation status and the second bit records initialization status. Assuming one byte is made up of eight bits, it follows that one byte of application-employed memory results in a correctness checking overhead of two bits. Purify checks each memory operation attempted by a program against the state of the memory block involved, to check whether the operation is valid, and if it is not, reports an error. Purify does not have direct support for memory correctness checking of inter-computer communications (e.g. via the Message Passing Interface, MPI).
Valgrind memcheck is a dynamic binary instrumentation tool which shadows each 8-bit byte of memory assigned to the user with 8-bits to track memory value validity (initialization state) and one bit to track memory access validity (allocation state). Assuming one byte is made up of eight bits, it follows that one byte of application-employed memory results in a correctness checking overhead of one byte and one bit. As memory is allocated as a whole number of bytes, the single allocation state bit applies to all 8-bits of an application-employed byte so that each bit of application-employed memory is associated with two bits representing the memory correctness states. Valgrind memcheck uses the MPI profiling interface to provide wrappers to certain MPI functions so that memory checking can be performed when transfers are made between MPI processes.
Parasoft's Insure++ is a source-code-level instrumentation tool for detecting C/C++ run-time memory errors. At present, however, there appears to be no documented support for MPI or RDMA operations as convenient for Partitioned Global Address Space languages.
There are currently no Remote Direct Memory Access (RDMA) instructions to support efficient and highly configurable memory correctness checking. One possibility to carry out memory correctness checking is the use of the Valgrind/memcheck tool's MPI wrappers, which currently however do not support the MPI-2 single-sided communication functions, and therefore do not permit memory correctness checking in combination with RDMA.
Current program and memory correctness tools built into existing compilers are unable to account for communication, so the programmer must engage in labor-intensive (and consequently error prone) debugging of memory correctness by writing wrapper functions needed to make an existing tool work with MPI, and printing out values individually.
Consequently, there is a need to combine remote data memory access and memory correctness checking in a more efficient manner.