This disclosure relates generally to large scale processing in a data processing system.
Applications executing at a peta-scale, executing more than one quadrillion operations per second, can be comprised of a million or more execution contexts or threads. This extremely large scale may present significant challenges to conventional debug approaches.
A typical objective of any debugging technique applied to a massively parallel application is locating a small number of failing processes, so conventional debugging techniques can be applied to those failing processes to examine in detail. Dealing with the volume of information available when debugging massively parallel applications has typically generated a set of approaches to the problem.
In one approach, a user inserts a logging capability of some kind into an application of interest. The logging capability may take the form of trace statements, with the simplest form being print statements. In another approach, the application may traps and generates a core dump, enabling the user to examine the resulting core dumps. In another approach a conventional debugger such as those obtained from a tools vendor, may used. The debugger may allow real-time examination and control of the application. However each debugger can typically only view a single process within a peta-scale application, typically resulting in an inability to provide guidance for the user to determine which thread or process needs to be debugged. In the context of the previous approach, tools may group execution contexts, and operate on a group of execution contexts using an interface similar to a conventional debugger. Although grouping solves a problem of reducing an amount of information presented to a user, the grouping activity typically introduces a difficulty associated with creating the groups of execution contexts.