1. Field of the Invention
This invention relates to a method and system for recovering devices on a computer network. More specifically, the invention provides such a method and system for an array of devices and processors having a hierarchical structure, such as non-uniform memory access (“NUMA”) class servers.
2. Background of the Invention
Computer processors and devices may be linked together through a computer network. By “device” is meant any device that adds capacity to the network, such as a disk storage device, any array of disk drives, or similar devices. Data is transferred between the processors and devices through an input/output (“I/O”) request. An I/O request is any software operation that transfers data to or from a processor or device.
In a computer network, the processors and devices are all identified to the network by a unique identifier or address. Unique identifiers or addresses are also used to define and identify divisions within the devices and processors such as memory locations, files, application programs, and users. A path is a route to or between address points or nodes within the organized network structure. By the term “node” is meant a connection point for data transmissions. A node may be a redistribution point or an end point for data transmissions. A node is generally programmed or engineered to recognize and process or forward transmissions to other nodes.
When the network connections are established, an operating system is loaded onto the processor or processors such that applications and devices may be run and controlled from the operating system. The operating system identifies the address of all devices, processors, and applications in the network. Devices, processors, and applications are all examples of nodes in the network. A system administrator may manually identify all nodes in the network, or alternatively, the operating system may issue standard commands to determine which nodes are available on the network.
As I/O requests are issued and processed between nodes, exception conditions can occur. By “exception” is meant a condition that causes a program or processor to branch to a different routine. Exception conditions are typically error conditions and can refer to either hardware or software conditions. An example of an exception condition is where a device issues an I/O request and never receives a response. After a given amount of time, the I/O request will “time out,” leading to the presumption that the I/O request was not processed. There are several possible reasons for this type of exception condition. The physical cable connection between the device and the processor or device to which the I/O request was directed might be severed, or the processor or device might be unable to process the number of I/O request being issued.
When an exception condition occurs, the network performs a recovery or revalidation operation. By the terms “recovery” or “revalidation” is meant an operation re-establishing the path to a node such that the node is properly identified to the system and I/O requests may be processed. As would be understood by one of ordinary skill in the art, recovery is generally performed through a standard series of software commands. One common standard for recovery commands is specified by the American National Standards Institute (“ANSI”).
Typically, when an exception condition occurs, every node on the computer network is recovered. Furthermore, I/O functions are suspended during the recovery operations. In relatively small computer networks, recovering every node does not normally significantly affect the network function. However, as computer networks become more complex and larger in size, recovering every node on the network becomes prohibitively time consuming and has a significant effect on network function and efficiency.
Thus, in accordance with the method and system described herein, the prior art problems including inefficient, massive recovery of network devices, processors, and other nodes, and other problems are avoided, and numerous advantages are provided.