Operating systems (OS) are a key building block in the development of computing systems. Over the several decades since personal computing has become widespread, operating systems have substantially increased in complexity. The ability to multi-task and support concurrent processes has given even modest personal computers the appearance of simultaneously running a wide variety of programs from word processors to Internet browsers.
In fact, though, virtually all microprocessor-based systems run one program at a time, using a scheduler to guarantee that each running program is given processor time in sufficient quantities to keep running. This task can become quite complex. Each process running on a computer can spawn individual tasks called threads. Some threads can spawn subordinate threads. It is common to have dozens, or even hundreds, of threads active at a given time. On the other hand, the computer may have a limited number of resources, such as disk storage or network input/output. Even though each resource can often support multiple threads, in many cases a thread may have to wait for access to a given resource until a different thread releases it.
A thread can lock a resource it is using and make it unavailable for other threads. A common situation occurs where two or more threads require resources that are locked by another thread. When threads lock each other's resources a deadlock may occur. Typically, a timeout timer will fire when inactivity is observed over a pre-determined time period and kill one or more of the involved threads. Unfortunately, most users are less patient than the timers and will intervene before the timeout period with a reset or other dramatic action. The timeout time can be shortened to beat user's impatience but at the risk of killing slow but not-deadlocked threads.
Another way to address deadlocks is strict monitoring of every locking relationship. However, in modern high-clock rate systems, locks can be placed and released in a matter of microseconds and it is not unusual for hundreds of locks to exist at any moment in time. Therefore, strict monitoring may require more processor resources than those being monitored and the associated memory write times could slow processing to a crawl.
The standard approaches to diagnosing and solving deadlocks are live debugging of the application process or capturing detailed information about the processes involved in the failure from the memory at the time of the failure for post-mortem analysis. Because a first thread's failure may be due to its dependency on a second thread's failure, finding the root source of a failure may be complicated. To find the root cause, the other thread or process which is responsible for the failure must be identified. However, the root failure is difficult to obtain during post-mortem because information to trace the root cause thread is not included in the process memory dump. Furthermore, even if the root cause failed thread can be obtained through additional debugging using the process memory dump, it may be impossible to debug further because information about the root cause process is not collected at the time of the failure.