1. Technical Field
This invention relates to a method and system for servicing a computer system. More specifically, the invention relates to a capture of a state of a node in a distributed computer system in response to an event.
2. Description of the Prior Art
In a distributed computer system with shared persistent storage, one or more server nodes are in communication with one or more client nodes. FIG. 1 is a block diagram (10) illustrating one example of a distributed computer system. As shown, there are two server nodes (12) and (14), three client nodes (16), (18), and (20), and a storage area network (5) that includes one or more storage devices (not shown). Each of the client nodes (16), (18), and (20) may access an object or multiple objects stored on the file data space (27) of the storage area network (5), but may not access the metadata space (25). In opening the contents of an existing file object on the storage media of the storage device in the storage area network (5), a client contacts the server node to obtain metadata and locks. Metadata supplies the client with information about a file, such as its attributes and location on the storage devices. Locks supply the client with privileges it needs to open a file and read or write data. The server node performs a look-up of metadata information for the requested file within the metadata space (25) of the storage area network (5). The server nodes (12) or (14) communicate granted lock information and file metadata to the requesting client node, including the location of the data blocks making up the file. Once the client node holds a distributed lock and knows the data block location(s), the client can access the data for the file directly from a shared storage device attached to the storage area network.
Distributed computer systems have complex messaging protocols that operate among server nodes and clients. Messages may be passed among the server nodes and clients for various purposes, including servicing techniques. When an error occurs in the operation of one of the server nodes and/or clients, isolating the problem is critical to identifying a solution to mitigate and/or prevent the problem from re-occurring. Traditional Unix systems have the ability to capture a logical image of the system for analysis and writing a file associated with the logical image to disk prior to a shut-down of the system. However, such a solution is limited to a single node, and is not extendible to a distributed computer system. Extending the solution of a single node system to a distributed system becomes complex in consideration of messaging techniques among the server nodes and/or clients.
One prior art solution, U.S. Patent Publication 2004/0010778 to Kaler et al., embeds debug controls along with distributed application data in messages that are utilized by distributed applications during normal operations. Kaler et al. uses in-band message protocols for communication in the distributed computer system, wherein message operations are transported across the system via routers and/or gateways. However, limitations associated with embedding debug controls in in-band message protocols include the inability to enable the client and/or server nodes in the system from differentiating the urgency of the message based upon the channel of communication. When a state of the system needs to be captured, urgency in communication among the server nodes and/or clients in the distributed system is critical.
Therefore, there is a need for a new dedicated messaging technique in a distributed computer system that enables efficient communication among the server nodes and/or clients. In addition, there is a need for creating a logical image of a distributed computer system at the time an error occurs so that the image can be studied to determine the cause for the occurrence of the error.