1. Field of the Invention
This invention relates generally to systems for capturing event data needed to isolate and correct defects in digital systems and particularly to an integrated facility for the continuous programmable acquisition of trace data in a distributed multiprocessor system.
2. Description of the Related Art
Some modem digital systems use a distributed processing architecture to achieve high performance and continuous availability. Although performance and availability in such a distributed processing system is significantly improved, system complexity and debugging difficulty are also increased, especially in digital systems employing very-large-scale integration (VLSI) hardware. Traditional facilities such as In-Circuit Emulation (ICE) and external logic analyzers are not always practical for debugging modem distributed processing systems.
Debugging distributed processing systems with logic analyzers is not feasible because of the number of logic analyzers required to obtain from multiple processors the trace data needed to isolate a system problem. Even with a sufficient number of logic analyzers, some of the necessary signals may not be available at external pinpoints in the distributed digital system. Distributed processing systems may be located remotely across buildings or cities. More importantly, even if the practitioner managed to attach many logic analyzers to the proper input/output (I/O) pins in the digital hardware, it is completely impractical to maintain tens, hundreds or even thousands of logic analyzers connected to every processor of a large distributed processing system at all times. Thus, the detection and analysis of a suspected problem devolves to a random procedure, where the practitioner connects several logic analyzers to some selected processors at some particular time, assuming hopefully that the suspected problem can be coaxed to again occur in the instrumented processors and not elsewhere in the system. When this fails, the practitioner is obliged to move logic analyzers from one processor to another in a pseudo-random fashion, hoping to eventually stumble across the suspected design flaw. As distributed processing systems increase in complexity, the probability of discovering and correcting system design problems falls so that the debug schedule ultimately must expand without limit, which is economically unacceptable. This economic restriction leads to arrival of digital systems at a user location with serious undiagnosed design flaws.
Separately, because of the volume, trace data generated by dozens of logic analyzers distributed across a digital system are often difficult to understand in context. A practitioner must examine data from one logic analyzer at a time and may not be able to simultaneously integrate data from many different logic analyzers. If data are stored for later analysis, the voluminous data entries must somehow be reassembled to properly depict the sequence of digital events throughout the distributed system. This serious problem has sharply limited debugging capability in distributed systems until now.
As is well-known in the art, digital design flaws are discovered and corrected in exponentially decreasing numbers over the various stages of design development. As the system design matures, diagnosis and cure of each new design flaw requires more time and effort. In a distributed digital system, these later "bugs" usually require analysis of large volumes of trace data assembled across many processors. Because of this, the practitioner may not be able to readily duplicate a system problem for which no data was acquired when first encountered. Conversely, acquisition and storage of debug data for all processors and all interfaces in a digital system all of the time is not feasible. Thus, the user is often obliged to assist as an unwilling partner in correcting these later bugs.
The increased density and performance of the new digital device technology also give rise to signal availability and speed problems. For instance, the event data needed for debugging may not be available at chip or card I/O pins suitable for connection to an external logic analyzer. This problem is exacerbated by increases in VLSI technology density. Similarly, increased logic speeds make it more difficult for a logic analyzer to keep up with device operating speed. Other hardware-related debugging problems include limited fan-out capacity in the VLSI chips connected to external logic analyzers and difficulties with event data synchronization across large numbers of logic analyzers operating at high speed. Finally, the usual ICE practices known in the art are often not feasible for debugging the latest systems where processor devices must be hard-soldered to a circuit card for performance and reliability purposes.
There is accordingly a clearly-felt need in the art for an improved debugging facility suitable for use in modem high-performance distributed digital processing systems. To appreciate the requirements for such a debugging system, note that debugging a distributed digital system includes (a) requirements for debugging software and firmware within each node of a distributed processing system, (b) requirements for debugging hardware within each node and (c) requirements for debugging both hardware and software underlying the internode data communications responsible for integrating the various node functions throughout the distributed digital system. Each of these separate issues requires a different debugging strategy. For instance, hardware design flaws are usually detected and corrected by "substitution", using ICE techniques known in the art, which are of little use in debugging system application software. Software is usually debugged using intrusive software analyzers, and internode message communication problems require analysis of global event data that are not available from within any particular node or combination of nodes.
The digital system art is replete with methods for resolving these three basic debugging issues, and most can be loosely classified as (1) in-circuit emulation (ICE) techniques, (2) dedicated hardware logic analyzer techniques, (3) intrusive software performance analyzer techniques using special interrupts and software "hooks", and (4) techniques using dedicated system hardware and software debugging elements.
Software debugging is often approached with some combination of external logic analyzer hardware and intrusive performance analyzer software. For instance, in U.S. Pat. No. 5,265,254, Blasciak et al. disclose a system of debugging software through the use of code markers inserted into spaces in the application source code during and after compilation. Blasciak et al. teach the addition of "intrusive" instructions or markers to the application software to produce simple, encoded memory references to memory or I/O locations that are always visible to an external logic analyzer as bus cycles but otherwise unused. While their technique is relatively unintrusive, their code markers are typically inserted at compile time or interactively during a debugging session and are not resident during normal system operation for capturing event data critical to unraveling an unexpected software glitch. Also, their technique requires external logical analyzer hardware, which is not feasible for large distributed systems.
In U.S. Pat. No. 5,274,811, Borg et al. disclose a method for quickly acquiring and using very long traces of mixed system and user memory references for debugging purposes by inserting intrusive code into the software undergoing debugging. Borg et at. store the results of their tracing operation until the application program execution can be interrupted to analyze the results of the tracing completed to date. By intermittently interrupting and analyzing, Borg et at. avoid the generation and storage of very long traces for later analysis and thereby avoid limitations on trace length. Thus, Borg et al. teach a useful solution to the general trace data length limitation known in the art and also avoid the external logic analyzer problem by using integrated hardware means for non-obtrusive generation of both software and hardware traces. However, they neither consider nor suggest methods for real-time debugging in a distributed system having many different processing nodes coupled together.
In Japanese patent JP 01-113841, the inventors describe a method for enhancing storage efficiency for trace data by discarding certain trace data that is unnecessary to the debugging procedure in a multi-tasking environment. Although the inventors consider means for accommodating the particular trace data duplication problems arising in a multi-tasking environment, they neither consider nor suggest methods for debugging in a distributed multiprocessor system.
Other practitioners have suggested improvements to various parts of the distributed data system debugging problem. For instance, V. A. Albaugh ("Combined Event Performance Trace For AIX", IBM Technical Disclosure Bulletin, Vol. 32, No. 10A, p. 101, March 1990) recommends a trace data collection mechanism consisting of a device driver, some trace recording routines and a process for reading the data and modifying the trace state. Albaugh uses intrusive software routines and a high-resolution timer for producing a multiplicity of time-stamped trace data entries, which are stored offline for later analysis and neither considers nor suggests solutions to the larger distributed general multiprocessor debugging problem. R. B. Basham et al. ("Microcode Data Event Logging in a Global Variable Environment", IBM Technical Disclosure Bulletin, Vol. 35, No. 7, pp. 41-42, December, 1992) discloses a programmable microcode mechanism for tracing bit manipulation of any specified data area in a microprocessor control store. Basham et al. use intrusive software to define and identify data of interest and to log their occurrence for future analysis. The performance degradation imposed by their technique limits its usefulness to debugging microcode during the chip development cycle. M. G. Smith ("Real-Time, Trace-Driven Monitor for File System Performance", IBM Technical Disclosure Bulletin, Vol. 34, No. 5, pp. 392-394, October, 1991) discloses a program that monitors a computer file system and I/O system in real-time to report performance event data over an arbitrarily long measurement interval. Smith uses intrusive software to capture and store events at all levels of the file system and to produce a comprehensive set of file and memory access statistics. None consider or suggest broader debugging techniques for distributed digital systems.
W. C. Carlson et al. ("Storing Variable Length Data in a Circular Buffer", IBM Technical Disclosure Bulletin, Vol. 36, No. 3, pp. 491-493, March, 1993) discloses a method for storing variable-length program trace data in a circular buffer to minimize storage time when extraction time is unimportant. Similarly, R. E. Eveland et al. ("Technique for Storing Variable Length Data in a Circulating Buffer", IBM Technical Disclosure Bulletin, Vol. 26, No. 1, pp. 86-88, June, 1983) discloses a method for using a variable-length circular buffer to avoid segmentation of variable-length trace data entries. In Japanese patent JP 02-81141, the inventors disclose a technique for improving trace buffer effectiveness by using a trace control bit in the trace buffer pointer to ensure storage only of particular trace data in the trace buffer. These are all useful solutions to the trace data entry length problem in debugging software but do not in themselves suggest solutions to the broader distributed system debugging issues discussed above.
Some practitioners propose improvements to the in-circuit emulation (ICE) or "substitution" technique used to debug hardware. For instance, in U.S. Pat. No. 4,674,089, Poret et at. disclose an ICE circuit that includes capture logic that monitors the contents of the program address register, the internal dam bus and various processor control lines and also includes trace data buffers for storing the captured event dam. Their ICE circuitry is included on the same silicon chip with the microprocessor but is left unused after completion of the microprocessor hardware debugging procedure. In U.S. Pat. No. 4,782,461, Mick et al. disclose a useful technique for the logical grouping of facilities within a computer development system to provide breakpoint control, trace control and device emulators for the design, debugging and testing of computer systems. The Mick et al. system is essentially an in-circuit emulator for VLSI devices. Neither Mick et at. nor Poret et al. consider nor suggest improvements for debugging distributed digital processors.
Some practitioners propose improved software performance analyzer techniques for debugging distributed multiprocessor systems. For instance, J. Garrison ("Distributed Trace; a Facility to Trace Data and Code Flows in a Requester/Server Environment", IBM Technical Disclosure Bulletin, Vol. 34, No. 4A, pp. 292-294, September, 1991) proposes a distributed trace (DT) facility for intrusively debugging concurrent processes in a processing network under the OS/2 operating system. Garrison et al. limit their teachings to instruction-level tracing in a few targeting operating systems and neither consider nor suggest improved procedures the debugging of distributed multinode systems.
Other practitioners describe software debugging techniques that rely on dedicated hardware and/or software facilities, often in conjunction with external logic analyzer hardware. For instance, in U.S. Pat. No. 4,879,646 Iwasaki et al. disclose a microprocessor chip design that includes a multistage pipeline structure dedicated to editing trace memory contents and tracing operations during system debugging. Iwasaki et at. essentially describe a dedicated on-chip hardware facility for tracing microprocessor instructions in advance so that the stored traces can be later analyzed to improve software debugging efficiency. They neither consider nor suggest solutions to the broader debugging requirements encountered in multinode distributed systems.
In U.S. Pat. No. 5,121,501, Baumgartner et at. disclose a method and apparatus for debugging software applications by inserting a limited number of software "hooks". They use a microprocessor system having a dedicated "output bus" for forwarding event dam associated with the "hooks". Although Baumgartner et al. propose a useful technique for continuous production of high-volume performance trace data for an extended time, they require external logic analyzer hardware (a second processor) together with intrusive application software modifications to accomplish this result and neither consider nor suggest how their technique can be usefully adapted to debug a distributed multinode system.
In U.S. Pat. Nos. 4,845,615 and 5,103,394, Blasciak discloses a dedicated software performance analyzer facility for non-intrusively measuring six different software execution performance parameters. Blasciak measures memory activity in certain predetermined address ranges to produce performance data but neither considers nor suggests techniques for capturing the general range of event data necessary to effectively debug a distributed multinode processing system.
C. S. Graham et al. ("Integrated Debug Tool", IBM Technical Disclosure Bulletin, Vol. 32, No. 2, pp. 103-106, July, 1989) disclose a dedicated software-debugging kernel that permanently resides in the base microcode operating system to enhance debugging of hardware and software in a single processing system. However, Graham et al. consider only instruction level tracing in a single processor and do not suggest how their permanent kernel technique can be expanded to resolve the distributed system debugging issues described above.
There is clearly a need in the art for a trace data acquisition system that avoids the above-recited deficiencies. Such a system should provide a debugging capability that remains indefinitely with the product in the field to assist in resolving software and system integration design flaws that encountered after factory release. The system should provide sufficient trace data to permit debugging of hardware and software as well as internodal integration and communications. These unresolved problems and deficiencies are clearly felt in the art and are solved by this invention in the manner described below.