1. Field of the Invention
The present invention relates to techniques for resetting components of a computer system and, more particularly, to techniques for resetting agents in a computer system without disrupting the operation of the computer system.
2. Related Art
All computer systems include a reset architecture of some kind. A computer system's reset architecture is responsible for resetting some or all of the components of the system to an initial state. A reset may be initiated, for example, when a computer system is booted up, in response to a user pressing a hardware reset button, or in response to an automated or user-invoked software reset instruction. If the computer system crashes, for example, it may be necessary for the user to invoke a hard reset by pressing a hardware reset button, thereby causing the computer system's memory and other components to be re-initialized and again become usable. Frequently, many or all components within a computer system are reset from the same reset signal, thereby ensuring that the overall system starts up in a defined state. The reset architecture in a standalone desktop computer, for example, typically uses a single reset signal to initiate a reset of all necessary components.
More complex computer systems may include multiple autonomous devices, such as embedded microcontrollers, system processors, or sets of complex logic. Each of these devices—referred to herein as “agents”—may have a distinct reset source. The term “multi-agent system” is used herein to refer to any computer system that includes multiple agents.
One example of a multi-agent computer system is a partitionable server, also referred to as a “consolidation server” or a “multi-partition computer.” Referring to FIG. 1A, for example, a functional block diagram is shown of a prior art partitionable server 100. The partitionable server 100 is a single physical computer system that is logically subdivided into multiple partitions 104a–c, each of which is allocated a portion of the server's hardware and/or software resources. Each of the partitions 104a–c may execute its own operating system and software applications. For example, as shown in FIG. 1A, partitions 104a–c execute operating systems 114a–c, respectively.
More generally, each of the partitions 104a–c is intended to be functionally equivalent to, and therefore externally indistinguishable from, a distinct standalone computer. Partitionable servers are sometimes referred to as “consolidation servers” because they may be used to consolidate several physical servers into one physical server having multiple partitions, each of which performs the functions of the physical server that it replaces. A conventional desktop or laptop computer may be considered to be a special case of a multi-partition computer, in which the number of partitions is one.
In the particular example shown in FIG. 1A, the partitionable server 100 also includes a plurality of agents 108a–c. The partitions 104a–c run off of main system power in a main power domain 102, while the agents 108a–c run off of auxiliary power in an auxiliary power domain 106, meaning that the agents 108a–c can continue to receive power even when the main power domain 102 is not providing power. In the example illustrated in FIG. 1A, each of the agents 108a–c monitors and supports a corresponding one of the partitions 104a–c. Partitions 104a–c and agents 108a–c communicate with each other over communications links 112a–c, respectively.
Each of the agents 108a–c includes its own reset circuitry, and agents 108a–c are capable of being independently reset by reset signals transmitted on reset lines 110a–c, respectively. As a result, it is possible for some of the agents 108a–c to be in the process of resetting while corresponding ones of the partitions 104a–c are still running. In some cases, one of the agents 108a–c and a corresponding one of the partitions 104a–c may be communicating with each other when the agent goes into reset unexpectedly (e.g., as the result of a watchdog timer triggering a reset or a user forcing a reset).
In multi-partition computers, such as the partitionable server 100 shown in FIG. 1A, it is highly desirable that the partitions 104a–c be isolated and independent, so that a failure (such as an operating system crash) in one of the partitions 104a–c does not cause a failure in other ones of the partitions 104a–c. Achieving this goal can be challenging for the system designer in many ways. In particular, it can be challenging to design the system 100 so that the act of resetting one of the agents 108a–c does not require the corresponding one of the partitions 104a–c, or the entire server 100, to be reset.
In most cases, the unexpected reset of one of the agents 108a–c will not disrupt the operation of either the corresponding partition or other ones of the partitions 104a–c in the server 100. In fact, partitionable servers and other multi-agent systems are typically designed to handle such an event gracefully. In certain circumstances, however, the unexpected reset of one of the agents 108a–c may cause undesirable effects, such as causing the corresponding one of the partitions 104a–c, or even the entire server 100, to crash. Typically, the server 100 may only be brought back into an operational state after such a crash by powering down the entire server 100 and then powering it back up again. This is one example of a “hard reset.” A complete system crash and reboot is extremely undesirable, particularly in cases in which the server 100 is relied upon for constant connectivity by hundreds or even thousands of other computer systems and peripherals.
Consider, for purposes of example, the agent 108a and the corresponding partition 104a. One set of circumstances under which an unexpected reset of the agent 108a may cause the corresponding partition 104a (or the entire server 100) to crash is when the partition 104a is in a run state in which the operating system 114a executing on the partition 104a assumes that the agent 108a will always be available for communication over the communications link 112a. Examples of agents that may be relied upon for such constant availability include, for example, input/output (I/O) controllers, hard disk drive controllers, local area network (LAN) controllers, manageability processors, crossbar circuitry, bus bridges, and circuits for monitoring and/or controlling components such as cooling fans. If the operating system 114a attempts to communicate with the agent 108a over the communications link 112a and the agent 108a does not respond (e.g., because the agent 108a is in the process of resetting), the operating system 114a may crash, thereby making the partition 104a inoperable until it is reset.
Therefore, under such conditions it is unsafe to reset the agent 108a because doing so may cause the corresponding partition 104a to crash. Any run state of a partition in which resetting the corresponding agent is likely or certain to cause the partition to crash will be referred to herein as an “unsafe run state.” Any run state of a partition in which resetting the corresponding agent is not likely or certain to cause the partition to crash will be referred to herein as a “safe run state.”
When the partition 104a, for example, is in a safe run state, conventional techniques may be employed to reset the agent 108a because resetting the agent 108a will not cause the partition 104a or the other partitions 104b–c to crash. When the partition 104a is in an unsafe run state, however, a different reset scheme must be used to avoid the undesirable effects described above.
What is needed, therefore, are improved techniques for resetting agents in computer systems.