Larger computer systems generally include a number of discrete subsystems, known as cells, for performing tasks under control of an operating system. In these systems, multiple copies of an operating system may be running at the same time. Each copy of the operating system is referred to as an instance of the operating system. Each instance of the operating system causes tasks to be performed by having one or more cells allocated to it and causing the cell(s) to perform the tasks. The type and number of cells allocated to an instance of the operating system may vary over time with the type and number of tasks the instance has to perform.
A system may include a number of cells that are not allocated to an instance of the operating system at a given time. These cells, known as floating cells, remain unused until they are allocated to an instance of the operating system. The other cells, known as allocated or owned cells, are under the control of an instance of the operating system. Many of the allocated cells may be constantly used by their respective operating system instances. Some of these allocated cells, however, may go unused by the instance to which the cells are allocated for relatively long periods of time.
The reliability of a computer system may depend on the reliability of the individual cells in the system. For example, if a cell in the system fails, the failing cell could potentially cause other cells in the system to fail and cause undesirable results to occur during operation of the system. Because both floating and allocated cells may be unused for relatively long periods of time, failures associated with these cells may take extended amounts of time to appear and may cause undesirable results when they do appear.
Accordingly, it would be desirable to be able to detect cell failures in a computer system before the failures cause undesirable results during operation of the system.