The present invention relates to computer systems, and more particularly to monitoring and debugging of computer systems.
A computer system is a complex machine, and problem diagnostics and other monitoring and debugging operations of such a system can be complicated. FIG. 1 illustrates an exemplary computer system 110 having one or more processors 120, a memory 130, and ports 140. Each processor 120 executes computer programs stored in memory 130. All or part of memory 130 and ports 140 may (or may not) be integrated with one or more processors 120 into a single chip. Ports 140 can be connected to external devices 142 such as network links, keyboards, computer monitors, printers, etc. The ports may be wired or wireless.
Each processor 120 includes registers 150 which store data used by a computer program executed by the processor. Registers 150 also store state information for the computer program. The state information may include, for example, the program counter which stores a memory 130 address of a memory location containing the instruction being executed or to be executed by the processor. The state information may include flags indicating whether the result of the most recent arithmetic instruction was positive, negative, or zero. Other information may also be stored in the registers.
Multiple computer programs can be executed in parallel. When a processor 120 switches from one computer program to another, the processor's registers 150 are saved in memory 130, and the other program's values are loaded into the registers.
Each computer program is represented as one or more processes 154 (FIG. 1 shows processes 154.1, 154.2, 154.S). Each process 154 is associated with resources identified by data stored in memory 130. In particular, data 154M describes the memory area 158 allocated for the process in memory 130. Data 154F identifies files (e.g. disk files) opened by the process. Data 154R identify the contents of registers 150: when a processor 120 interrupts execution of the process to execute another process, the processor's registers are stored as data 154R (in the process's stack in the corresponding area 158 for example); when the process execution is resumed by some processor 120, the data 154R are loaded into the processor's registers 150. Other resources may include locks used by the process to control access to other resources.
The processes are managed by the operating system, which itself is a computer program having one or more processes 154. In the example of FIG. 1, the process 154.S is an operating system process. Its memory area 158 includes process management module 156 with code and data for creating and terminating other processes 154, scheduling the other processes for execution on processors 120, maintaining process-related information including the process data 154M and 154F (which can be stored in module 156), and performing other process management tasks.
To monitor or debug the computer system, a computer developer may want to stop the computer system at any given time to examine the memory 130 and registers 150. However, in a production environment, the developer may want to monitor or debug the system 110 without stopping the system. Some computer systems allow the developer to get a snap shot of the memory area 138 occupied by any given process 154. For example, in some Unix-like operating systems, the developer may use a fork-and-kill method to generate a “core dump” file 160 for a process 154 without stopping the process.
Core dump 160 is a disk file created on a device 142, which may include a computer disk or some other storage. Core dump 160 contains the image of the memory area 158 and processor registers 150 for one process 154. The fork-and-kill method involves UNIX functions fork( ) and kill( ).
The fork( ) function, when called by any process 154, causes the operating system to create a new process (“child” process) identical to the calling process. FIG. 2 illustrates an example which initially included just two processes 154.1 and 154.2. Process 154.1 corresponds to memory area 158.1. The memory area of process 154.2 is not shown. Process 154.S includes memory management module 156 which implements the fork( ) and kill( ) functions as shown respectively at 180 and 190.
When process 154.1 calls fork( ) a child process 154.3 is created. Memory area 158.3 is allocated for the child process. Memory area 158.1 is copied to memory area 158.3, and is identical to memory area 158.1 except possibly as needed to update the memory references in the memory area. (The memory copying may be delayed under the “copy-on-write” paradigm, but will be performed when the core file 160 is created by the child process 154.3 as described below.)
The fork( ) implementation 180 also creates the data such as 154M, 154F, 154R for process 154.3 in suitable memory areas.
The fork( ) function can generally be used for many purposes unrelated to core dumps. For example, if computer system 110 is a network switch or router, then a new child process (such as 154.3) may be created by fork( ) for each incoming packet. The new process inherits the packet-processing code and data from the fork-calling process (such as 154.1), so only minimal initialization for the new process may be needed. When the new process finishes the packet processing, the new process may terminate using the exit( ) function call. The exit( ) function does not create a core dump.
In the example of FIG. 2, the child process 154.3 terminates with a kill( ) function call. This function, when executed by module 190, creates the core dump 160. The kill( ) function is called only by the child process 154.3, not by the parent process 154.1.
This fork-and-kill method is limited however when applied to multithreaded processes. A multithreaded process includes multiple threads 210 (FIG. 3) which compete for processors 120. Each thread 210 is associated with the same memory, files, and possibly other resources as the corresponding process 154, but each thread 210 has its own copy of processor registers 154R. The operating system's process and thread management module 156 schedules individual threads 210, not processes 154, for execution on processors 120. In FIG. 3, process 154.1 has two threads 210.1, 210.2. (The threads are managed by the operating system's process and thread management module 156.)
When a processor 120 is executing a thread 210 (say 210.1) and the thread calls fork( ) the operating system's fork( ) function 180 creates a new process, say 154.3, like in FIG. 2. However, only one thread 210 is created for the new process, which is a copy of the calling thread 210.1. The other threads are not replicated in order to simplify synchronization between the new process 154.3 and the threads of process 154.1. Therefore, when thread 210.1 of process 154.3 calls kill( ) in the fork-and-kill method, the core dump file 160 will contain the registers 154R for only the calling thread 210.1 of process 154.3 (which are the same or almost the same as the registers 154R of thread 210.1 of process 154.1). The registers 154R of the other threads of process 154.1 will be unavailable in the core dump.