1. Field of the Invention
The present invention relates generally to multi-processor/multi-node networks, and particularly to a system and method for dynamically monitoring network packets communicated among the nodes, detecting errors and gathering debug data at the nodes in real-time.
2. Discussion of the Prior Art
Currently, in multi-processor/multi-node networked systems, one processor may experience a problem, but data supporting the debugging of this problem can be lost due to the slow reaction of monitoring tools/personnel monitoring the system, and the number of nodes involved in the network. The problem is further compounded if concurrent timestamped data from many nodes/systems in many locations is required to debug the problem.
U.S. Pat. No. 5,119,377 describes a problem management system for a networked computer system whereby error detection code is placed within the software programs during program development. When an error or failure is detected at a node, a process is executed that captures only the data required to debug the software error. Data to be captured is defined statically before execution time.
U.S. Pat. No. 6,769,077 describes a remote kernel debugging system in which a host computer debugger remotely issues a command to stop execution of the core operating system of a target computer and, a snapshot of the physical memory of a target computer is extracted and stored by the host computer over a serial bus.
Other solutions to this problem include products such as local area network (LAN) “sniffers” which monitor network packets in a passive manner. They collect network packets of a specific type by means of a filter definition. The drawback of such a solution is that data buffers can overflow very quickly and packets of interest can be lost. Furthermore, such solutions do not trigger data collection based on problem occurrence. Another drawback is that any packet analysis requires post-processing of the data.
It would be highly desirable to provide a system and method for collecting information about a program failure occurring at a node of networked computers, where debug information is collected at the time of first error detection and collected dynamically from multiple systems at execution time.