Existing servers running UNIX brand or comparable operating systems (such as HP-UX) generally must remain running for extended periods, in some cases for months. However, such a server may—after prolonged use—behave slowly or become unresponsive. For example, owing to prolonged and heavy use, the server may have become so loaded that it does not allow the spawning of even a single process. For example, even simple programs such as “ps” and “ls” may fail and give rise to the error message “fork: no more processes”.
In such situations, system administrators at user sites employ performance monitoring tools and “Live” Kernel debugging tools. However, these tools essentially run as user processes and, in heavily loaded systems, even these processes require system resources so may not work. Also, kernel debuggers require the kernel to be booted with special flags, a further inconvenience.
Some operating systems include a system debugger that operates in the kernel space of their respective kernels and which can be entered upon entering a special key sequence from the system console. However, single system debuggers are often assembly level debuggers, so are limited to program debugging at the machine instruction level; they do not allow at program debugging at the source level. In addition, they cannot kill offending processes or free up any resources from the system. Significantly, they require the kernel always to be booted with special debug options, such that these debuggers remain effective throughout the lifetime of the system; as a result their use can significantly reduce the performance of the system at the lowest level, especially on interrupt and trap processing by the kernel, such effects arising from the memory requirements of running the debugger.
Alternatively, when a system hangs, the system administrator may induce a crash dump so that he or she can perform a post-crash (or ‘post-mortem’) analysis of the problem; this is time consuming for the system administrator and incurs considerable downtime for users, particularly those of large enterprise servers. Such downtime is unwelcome for critical applications, reducing as it does the availability of, for example, enterprise servers.
Similarly, following a ‘kernel panic’ (a software failure inside the kernel) in most implementations of UNIX, actual post-mortem analysis can begin only after a crashdump (which provides a snapshot of the physical memory state) has been written to the disk and subsequently saved to the file system. As mentioned above, this leads to considerable system downtime, which may not be acceptable to enterprise UNIX customers.
FIG. 1 is a schematic time-line 100 of the downtime in a crash dump based method of the background art. The system is booted up at time T=0. At T=t1 the system enters the hung state and the system administrator issues a forced crash. From T=t1 to T=t2 a file system buffer save is performed and the crash dump collection writes the dump to the dump device (i.e. the dump disk). At T=t2 the system is down and, from T=t2 to T=t3 firmware tests are performed by the system administrator. A system boot is performed at T=t3 and, from T=t3 to T=t4, the startup script (known in HP-UX as savecrash(1)) saves the dump to the file system. From T=t4 the system is again running normally.
The total downtime of the system is therefore ΔT=(t4−t1), the greater part of which is due to the time required for the crash dump collection to write the dump to the dump disk; the next greatest contribution to the downtime is due to the time required to copy the crash dump from the dump device to the file system. In most UNIX implementations, the crash dump readers cannot process the crash dump until the crash dump has actually been written onto the file system, because the dump in the dump disk device is not in a format that can be understood by the file system. This is essentially a two step process, as servers cannot write crash dump onto the file system directly while going down after the crash; this is because the state of the file system itself may be inconsistent, and using the file system at that time may involve significant risk.
The main limitation of this approach is that the actual analysis of the problem can start only once the dump has been saved to disk (such as with the aforementioned savecrash(1) utility) during the boot process that follows the crash. Also, users cannot again use the system until after the system has completed saving the dump to the file system. The writing of the dump to the dump device itself depends on the size of the physical memory of the system and, as mentioned, this contributes the greatest amount of downtime. This approach is essentially the same method followed for all kinds of crash dumps, whether after a software panic, a hardware failure or the detection of a hung system.