Parallel computer architectures generally provide multiple processors that can each be executing different tasks simultaneously. One such parallel computer architecture is referred to as a multithreaded architecture (MTA). The MTA supports not only multiple processors but also multiple streams executing simultaneously in each processor. The processors of an MTA computer are interconnected via an interconnection network. Each processor can communicate with every other processor through the interconnection network. FIG. 1 provides a high-level overview of an MTA computer. Each processor 101 is connected to the interconnection network and memory 102. Each processor contains a complete set of registers 101a for each stream. In addition, each processor also supports multiple protection domains 101b so that multiple user programs can be executing simultaneously within that processor.
Each MTA processor can execute multiple threads of execution simultaneously. Each thread of execution executes on one of the 128 streams supported by an MTA processor. Every clock time period, the processor selects a stream that is ready to execute and allows it to issue its next instruction. Instruction interpretation is pipelined by the processor, the network, and the memory. Thus, a new instruction from a different stream may be issued in each time period without interfering with other instructions that are in the pipeline. When an instruction finishes, the stream to which it belongs becomes ready to execute the next instruction. Each instruction may contain up to three operations (i.e., a memory reference operation, an arithmetic operation, and a control operation) that are executed simultaneously.
The state of a stream includes one 64-bit Stream Status Word (“SSW”), 32 64-bit General Registers (“R0-R31”), and eight 32-bit Target Registers (“T0-T7”). Each MTA processor has 128 sets of SSWs, of general registers, and of target registers. Thus, the state of each stream is immediately accessible by the processor without the need to reload registers when an instruction of a stream is to be executed.
The MTA uses program addresses that are 32 bits long. The lower half of an SSW contains the program counter (“PC”) for the stream. The upper half of the SSW contains various mode flags (e.g., floating point rounding, lookahead disable), a trap disable mask (e.g., data alignment and floating point overflow), and the four most recently generated condition codes. The 32 general registers are available for general-purpose computations. Register R0 is special, however, in that it always contains a 0. The loading of register R0 has no effect on its contents. The instruction set of the MTA processor uses the eight target is registers as branch targets. However, most control transfer operations only use the low 32 bits to determine a new program counter. One target register (T0) points to the trap handler, which may be an unprivileged routine. When the trap handler is invoked, the trapping stream starts executing instructions at the program location indicated by register T0. Trap handling is thus lightweight and independent of the operating system (“OS”) and other streams, allowing the processing of traps to occur without OS interaction.
Each MTA processor supports as many as 16 active protection domains that define the program memory, data memory, and number of streams allocated to the computations using that processor. The operating system typically executes in one of the domains, and one or more user programs can execute in the other domains. Each executing stream is assigned to a protection domain, but which domain (or which processor, for that matter) need not be known by the user program. Each task (i.e., an executing user program) may have one or more threads simultaneously executing on streams assigned to a protection domain in which the task is executing.
The MTA divides memory into program memory, which contains the instructions that form the program, and data memory, which contains the data of the program. The MTA uses a program mapping system and a data mapping system to map addresses used by the program to physical addresses in memory. The mapping systems use a program page map and a data segment map. The entries of the data segment map and program page map specify the location of the segment in physical memory along with the level of privilege needed to access the segment.
The number of streams available to a program is regulated by three quantities slim, scur, and sres associated with each protection domain. The current numbers of streams executing in the protection domain is indicated by scur; it is incremented when a stream is created and decremented when a stream quits. A create can only succeed when the incremented scur does not exceed sres, the number of streams reserved in the protection domain. The operations for creating, quitting, and reserving streams are unprivileged. Several streams can be reserved simultaneously. The stream limit slim is an operating system limit on the number of streams the protection domain can reserve.
When a stream executes a CREATE operation to create a new stream, the operation increments scur, initializes the SSW for the new stream based on the SSW of the creating stream and an offset in the CREATE operation, loads register (T0), and loads three registers of the new stream from general purpose registers of the creating stream. The MTA processor can then start executing the newly created stream. A QUIT operation terminates the stream that executes it and decrements both sres and scur. A QUIT_PRESERVE operation only decrements scur, which gives up a stream without surrendering its reservation.
The MTA supports four levels of privilege: user, supervisor, kernel, and IPL. The IPL level is the highest privilege level. All levels use the program page and data segment maps for address translation, and represent increasing levels of privilege. The data segment map entries define the minimum levels needed to read and write each segment, and the program page map entries define the exact level needed to execute from each page. Each stream in a protection domain may be executing at a different privileged level.
Two operations are provided to allow an executing stream to change its privilege level. A “LEVEL_ENTER lev” operation sets the current privilege level to the program page map level if the current level is equal to lev. The LEVEL_ENTER operation is located at every entry point that can accept a call from a different privilege level. A trap occurs if the current level is not equal to lev. The “LEVEL_RETURN lev” operation is used to return to the original privilege level. A trap occurs if lev is greater than the current privilege level.
An exception is an unexpected condition raised by an event that occurs in a user program, the operating system, or the hardware. These unexpected conditions include various floating point conditions (e.g., divide by zero), the execution of a privileged operation by a non-privileged stream, and the failure of a stream create operation. Each stream has an exception register. When an exception is detected, then a bit in the exception register corresponding to that exception is set. If a trap for that exception is enabled, then control is transferred to the trap handler whose address is stored in register T0. If the trap is currently disabled, then control is transferred to the trap handler when the trap is eventually enabled, assuming that the bit is still set in the exception register. The operating system can execute an operation to raise a domain signal exception in all streams of a protection domain. If the trap for the domain signal is enabled, then each stream will transfer control to its trap handler.
Each memory location in an MTA computer has four access state bits in addition to a 64-bit value. These access state bits allow the hardware to implement several useful modifications to the usual semantics of memory reference. These access state bits are two data trap bits, one full/empty bit, and one forward bit. The two data trap bits allow for application-specific lightweight traps, the forward bit implements invisible indirect addressing, and the full/empty bit is used for lightweight synchronization. The behavior of these access state bits can be overidden by a corresponding set of bits in the pointer value used to access the memory. The two data trap bits in the access state are independent of each other and are available for use, for example, by a language implementer. If a trap bit is set in a memory location, then an exception will be raised whenever that location is accessed if the trap bit is not disabled in the pointer. If the corresponding trap bit in the pointer is not disabled, then a trap will occur.
The forward bit implements a kind of “invisible indirection.” Unlike normal indirection, forwarding is controlled by both the pointer and the location pointed to. If the forward bit is set in the memory location and forwarding is not disabled in the pointer, the value found in the location is interpreted as a pointer to the target of the memory reference rather than the target itself. Dereferencing continues until either the pointer found in the memory location disables forwarding or the addressed location has its forward bit cleared.
The full/empty bit supports synchronization behavior of memory References. The synchronization behavior can be controlled by the full/empty control bits of a pointer or of a load or store operation. The four values for the full/empty control bits are shown below.
VALUEMODELOADSTORE0normalread regardlesswrite regardlessand set full1reservedreserved2futurewait for fullwait for fulland leave fulland leave full3syncwait for fullwait for emptyand set emptyand set fullWhen the access control mode (i.e., synchronization mode) is future, loads and stores wait for the full/empty bit of the memory location to be accessed to be set to full before the memory location can be accessed. When the access control mode is sync, loads are treated as “consume” operations and stores are treated as “produce” operations. A load waits for the full/empty bit to be set to full and then sets the full/empty bit to empty as it reads, and a store waits for the full/empty bit to be set to empty and then sets the full/empty bit to full as it writes. A forwarded location (i.e., its forward bit is set) that is not disabled (i.e., by the access control of a pointer) and that is empty (i.e., full/empty bit is set to empty) is treated as “unavailable” until its full/empty bit is set to full, irrespective of access control.
The full/empty bit may be used to implement arbitrary indivisible memory operations. The MTA also provides a single operation that supports extremely brief mutual exclusion during “integer add to memory.” The FETCH_ADD operation loads the value from a memory location, returns the loaded value as the result of the operation, and stores the sum of that value and another value back into the memory location.
Each protection domain has a retry limit that specifies how many times a memory access can fail in testing full/empty bit before a data blocked exception is raised. If the trap for the data blocked exception is enabled, then a trap occurs. The trap handler can determine whether to continue to retry the memory access or to perform some other action. If the trap is not enabled, then the next instruction after the instruction that caused the data blocked exception is executed.
A speculative load occurs typically when a compiler generates code to issue a load operation for a data value before it is known whether the data value will actually be accessed by the program. The use of speculative loads helps reduce the memory latency that would result if the load operation was only issued when it was known for sure whether the program actually was going to access the data value. Because a load is speculative in the sense that the data value may not actually be needed by the program, it is possible that a speculative load will load a data value that the program does not actually use. The following statements indicate program statement for which a compiler may generate a speculative load:
if i<Nx=buffer[i]endifThe following statement illustrate the speculative load that is placed before the “if” statement.
r=buffer[i]if i<Nx=rendifThe compiler has generated code to load the data value for buffer[i] into a general register “r” and placed it before the code generated for the “if” statement condition. The load of the data value could cause an exception, such as if the index i was so large that an invalid memory location was being accessed. However, the necessity of this exception is uncertain at that time since the invalid memory location will not be accessed by the original code unless the “if” statement condition is satisfied (i.e., i<N). Even if the “if” statement condition is satisfied, the exception would not have occurred until a later time. To prevent a speculative load from causing an exception to occur or occur too early, the MTA has a “poison” bit for each general register. Whenever a load occurs, the poison bit is set or cleared depending on whether an exception would have been raised. If the data in a general register is then used while the corresponding poison bit is set, then an exception is raised at the time of use. In the above example, the “r=buffer[i]” statement would not raise an exception, but would set the corresponding poison bit if the address is invalid. An exception, however, would be raised when the “x=r” statement is executed accessing that general register because its poison bit is set. The deferring of the exceptions and setting of the poison bits can be disabled by a speculative load flag in the SSW.
FIG. 2A illustrates the layout of the 64-bit exception register. The upper 32-bits contain the exception flags, and the lower 32 bits contain the poison bits. Bits 40-44 contain the flags for the user exceptions, which include a create stream exception, a privileged instruction exception, a data alignment exception, and a data blocked exception. A data blocked exception is raised when a data memory retry exception, a trap 0 exception, or a trap 1 exception is generated. The routine that is handling a data blocked exception is responsible for determining the cause of the data blocked exception. The exception register contains one poison bit for each of the 32 general registers. If the poison bit is set, then an attempt to access the content of the corresponding register will raise an exception.
FIG. 2B illustrates the layout of the 64-bit stream status word. The lower 32 bits contain the program counter, bits 32-39 contain mode bits, bits 40-51 contain a trap mask, and bits 52-63 contain the condition codes of the last four instructions executed. Bit 37 within the mode bits indicates whether speculative loads are enabled or disabled. Bit 48 within the trap mask indicates whether a trap on a user exception is enabled (corresponding to bits 40-44 of the exception register). Thus, traps for the user exceptions are enabled or disabled as a group.
FIG. 2C illustrates the layout of a word of memory, and in particular a pointer stored in a word of memory. Each word of memory contains a 64-bit value and a 4-bit access state. The 4-bit access state is described above. When the 64-bit value is used to point to a location in memory, it is referred to a “pointer.” The lower 48 bits of the pointer contains the address of the memory location to be accessed, and the upper 16 bits of the pointer contain access control bits. The access control bits indicate how to process the access state bits of the addressed memory location. One forward disable bit indicates whether forwarding is disabled, two full/empty control bits indicate the synchronization mode; and four trap 0 and 1 disable bits indicate whether traps are disabled for stores and loads, separately. If the forward disable bit is set, then no forwarding occurs regardless of the setting of the forward enable bit in the access state of the addressed memory location. If the trap 1 store disable bit is set, then a trap will not occur on a store operation, regardless of the setting of the trap 1 enable bit of the access state of the addressed memory location. The trap 1 load disable, trap 0 store disable, and trap 0 load disable bits operate in an analogous manner. Certain operations include a 5-bit access control operation field that supersedes the access control field of a pointer. The 5-bit access control field of an operation includes a forward disable bit, two full/empty control bits, a trap 1 disable bit, and a trap 0 disable bit. The bits effect the same behavior as described for the access control pointer field, except that each trap disable bit disables or enables traps on any access and does not distinguish load operations from store operations.
When a memory operation fails (e.g., synchronized access failure), an MTA processor saves the state of the operation. A trap handler can access that state. That memory operation can be redone by executing a redo operation (i.e., DATA_OP_REDO) passing the saved state as parameters of the operation. After the memory operation is redone (assuming it does not fail again), the trapping stream can continue its execution at the instruction after the trapping instruction.
The appendix contains the “Principles of Operation” of the MTA, which provides a more detailed description of the MTA.
While the use of a multithreaded architecture provides various benefits for the execution of computer programs, multithreaded architectures also add various complexities to the development and testing of application programs. Debugger programs, used to control execution of other executable code in order to identify errors and obtain information about the execution, are one type of application program which may face additional complexities in a multithreaded environment but may also benefit from capabilities of the environment.
For example, a common feature in debugger programs is the ability to set one or more breakpoints in the target code (i.e., the code to be debugged). When the executing target code encounters such a breakpoint, execution is halted and control of the target code execution is transferred to the debugger. On sequential machines (i.e., those with only one thread can execute at a time), breakpoints are often implemented by replacing an instruction in the target code with a breakpoint instruction to halt execution of the target code (e.g., a trap instruction or a jump to the debugger). At some point after execution of the target code has been halted by a breakpoint, a user of the debugger will indicate that execution of the target code should resume.
Upon receiving the indication to resume, the debugger first executes the replaced instruction in-line with the rest of the target code (i.e., at the memory location in which the code was originally loaded) before continuing execution. This in-line execution is accomplished by temporarily returning the replaced instruction to its original position in the target code, executing the next instruction of the target code (i.e., single stepping the target code) so that the replaced instruction is executed, re-replacing the replaced instruction with the breakpoint instruction so that future executions of this code will encounter the breakpoint, and then resuming execution of the target code at the instruction to be executed following the replaced instruction in the execution sequence. In-line execution allows the replaced instruction to be executed in the location in which it was originally loaded and in its original execution environment (e.g., using the current state of current register and stack values).
While the described breakpoint technique is appropriate for sequential machines, problems arise when this technique is used in a multithreaded environment. For example, when the replaced instruction is temporarily returned to the target code for the single-step in-line execution, other threads may execute the replaced instruction instead of the breakpoint. Thus, some threads may not break even though a valid breakpoint has been installed.
Debugger programs face other difficulties in providing desired capabilities regardless of whether execution occurs in sequential or multithreaded architectures. For example, in addition to setting breakpoints, debuggers often provide the capabilities to set watch points for monitoring when a value in a memory location changes, to evaluate user-supplied expressions in the current context of the target code, to single-step the evaluation of the target code, etc. To implement such capabilities, debuggers typically rely on OS support to obtain information about the state of the target code (e.g., values currently stored in memory locations) and to perform operations such as replacing a target code instruction with a breakpoint instruction. However, the level of support available can vary with different OSes, and the types of support may also vary with the type of target code. For example, debugging of an operating system typically requires a separate debugger (e.g., a kernel debugger) than for a user application program, and different debuggers may be needed for application programs written in different computer languages (e.g., Java, C++, or Fortran).
Another debugger difficulty arises if a breakpoint has been set on an instruction that is used by the debugger as well as by the target code (e.g., on a function in a shared library or on a commonly used function such as ‘print’). If the debugger executes the breakpoint, execution of the debugger may halt with no means to resume the execution. Thus, various steps must be taken to ensure that the debugger will not perform breakpoints. One memory-intensive approach that addresses this problem involves creating separate copies of any shared function so that the breakpoint set in the target code copy of the function will not be present in the debugger copy of the function.
Finally, when a debugger is not available to locate an error in target code or when only static dump state information (i.e., various information about the state of a computer system near the moment of system crash, such as a memory core dump or hardware scan file) for the target code is available, analysis of the static dump state information may be the only debugging recourse. Such analysis is typically performed manually by reviewing bit values, a time-consuming process which may reveal only limited information.