Known implementations of multi-threaded applications collect trace data in a shared trace table to support first failure data capture (FFDC) for problem analysis and debugging. In general, some sort of filtering is implemented, but trace data is generated for everything continuously. Most traces can be filtered on input by setting classes/levels or trace table size and filtered on output by extracting a subset of data, or both. The problem with those methods is that either too little or too much data is collected. When a large number of trace points are enabled, the shared trace wraps in a short period, thus losing crucial data. When too few traces are enabled, there is insufficient documentation for first failure data capture. Another variation of the problem is evident in the event of a hung work unit or hung work units, where the private trace tables contain the full trace information indefinitely; the trace data will never be over-written by other threads. These problems require users to recreate failing scenarios with additional traces enabled and potentially with special versions of programs with additional traps or traces. The growing power and workload managed by servers greatly increases the extent of dais problem. As systems contain larger numbers of CPU's, the snared trace table causes performance degradation due to memory cache contention. The use of very large trace tables or continuously off-loading the trace data to external media, simply defers the problem because eventually the size of trace data must be limited. Extremely large amounts of trace data also create a management problem when transmitting data to a service center, formatting, and analyzing the trace information. Output filtering methods do not reduce the amount of data generated, just the final step of analysis. A new solution is required to enhance first failure data capture capability such that more trace data can be continuously generated while less data is written out into the shared trace table. This problem has been observed in many service provider applications where a long running application accepts many work units from another layer in the same system or across the network. This problem has been observed in various components of enterprise operating system environments, which rely on clusters of servers, when looking at customer and system test problems. However, this problem is not limited to the above mentioned applications and it is not limited to IBM network applications. Other applications and vendors are similarly affected.
Therefore, the need exists for a method of creating and preserving maximum trace data for every work unit until the work unit is complete, while minimizing the trace data for successfully completed work units.
Further, the need exists for a method to reduce memory cache contentions, which degrade server workload performance.