In a multi-node system, tasks run concurrently in a distributed manner on nodes of the multi-node system. A task may wait for completions of other local or remote tasks. Timers are often used to prevent a task from forever waiting for completion of another task.
According to one scheme, timers may be set individually and loosely. For example, a software developer that creates software performing these tasks may set the timers based on an understanding of what the likely runtime environment will be. A system administrator managing the multi-node system may set the timers based on an understanding of what the actual runtime environment is.
As tasks may be interrelated in a complex way, a timer expires in one task often affects other tasks. For example, a database access task may depend on an OS task, which in turn may depend on a disk I/O task. When a timer in the disk I/O task expires, this may cause the disk I/O task to experience a timeout error. In turn, the timeout error may be returned to the OS task and the database access task. Thus, a timeout error occurring in one task may have cascading negative effects on other tasks.
In a loosely managed system, multiple inopportune timeout errors caused by a common problem may occur at substantially the same time. These near-simultaneous timeout errors may cause a part, or all, of a node to be deemed out of service, and may even bring down other nodes in the multi-node system.
As clearly shown, techniques are needed to improve management of timers in a multi-node system.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.