A file server is a computer that provides file service relating to the organization of information on storage devices, such as disks. The file server or filer may be embodied as a storage system including a storage operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on the disks. Each “on-disk” file may be implemented as a set of disk blocks configured to store information, such as text, whereas each directory may be implemented as a specially-formatted file in which information about other files and directories are stored.
As used herein, the term storage operating system generally refers to the computer-executable code operable on a storage system that manages data access and client access requests and may implement file system semantics in implementations involving filers. In this sense, the Data ONTAP™ storage operating system, available from Network Appliance, Inc. of Sunnyvale, Calif., which implements a Write Anywhere File Layout (WAFL™) file system, is an example of such a storage operating system implemented as a microkernel within an overall protocol stack and associated disk storage. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications or debugging/servicing applications as described herein.
A filer's disk storage is typically implemented as one or more storage volumes that comprise physical storage disks, defining an overall logical arrangement of storage space. Available filer implementations can serve a large number of discrete volumes (150 or more, for example). A storage volume is “loaded” in the filer by copying the logical organization of the volume's files, data and directories into the filer's memory. Once a volume has been loaded in memory, the volume may be “mounted” by one or more users, applications, devices, etc. permitted to access its contents and navigate its namespace. As used herein, a volume is said to be “in use” when it is loaded in a filer's memory and at least one user, application, etc. has mounted the volume and modified its contents.
A filer may be configured to operate according to a client/server model of information delivery to thereby allow many clients to access files stored on a server, e.g., the filer. In this model, the client may comprise an application, such as a file-system protocol, executing on a computer that “connects” to the filer over a computer network, such as a point-to-point link, shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. Communications between the filer and its clients are typically embodied as packets sent over the computer network. Each client may request the services of the filer by issuing file-system protocol messages formatted in accordance with a conventional file-system protocol, such as the Common Internet File System (CIFS) or Network File System (NFS) protocol.
When the filer runs the storage operating system or other applications, a problem/error in the execution of programming code of the storage operating system or the other applications may occur at any given time. Of particular interest are three types of problem situations that may occur while executing applications on the filer: 1) when an unrecoverable exception occurs causing a “crash” and reboot of the filer, 2) when the performance of the filer is significantly reduced, and 3) when the filer is “wedged” and can not receive and perform administrative commands. In the first situation, an exception in the operating system has occurred that renders the filer inoperative and the filer must be rebooted (the filer restarted and the operating system reloaded). In the second situation, the filer read/write operations for client requests have significantly slowed due to some operating system error (e.g., a memory leak where the filer's memory resources are being allocated but are not being released after usage). In the third situation, the filer is “wedged” when it may or may not be able to perform read/write operations, but can not receive and perform administrative commands (command line instructions), i.e., is apparently unresponsive to all or certain types of commands or operations, in particular administrative commands.
In all three problem situations, a reboot of the filer can be performed. In the first situation, a reboot is automatically performed. In the second and third situations, a manual reboot can be performed (e.g., by issuing an administrative command or pressing the filer's reset button). During the reboot, the filer typically performs a reboot/shut-down procedure that includes a corefile routine that generates a corefile (core dump) that is stored to the filer's memory. The corefile comprises a static image of the memory content/data and state of the filer at the time the corefile routine is performed. The corefile can then be analyzed by a debugging program (debugger) operated by a programmer to determine the problem/error that occurred during the execution of the operating system or other application and to help develop programming code that will avoid the problem/error in the future.
The corefile routine creates a corefile comprising a corefile header and data that is copied from the filer's memory (referred to herein as “filer memory data”). The corefile header comprises corefile metadata (data describing the corefile and the filer) and a set of memory range descriptors that provide an address mapping table between the filer's memory addresses and the corefile addresses. Typically, only particular areas of the filer memory data are copied to the corefile. These areas are generally those areas of filer memory data that are accessible and important for debugging purposes.
Typically, the debugger resides on a client administering computer that is remote from the filer and receives the corefile through a network that connects the administering computer and filer. However, corefiles can be very large in size (due to the copied filer memory data) and require long upload times to the administering computer. As such, a Core Daemon Protocol (CDP) can be used to allow remote analysis of the corefile by the debugger without requiring uploading of the corefile to the administering computer where the debugger resides.
The Core Daemon Protocol is a simple file-access protocol specific for corefiles that allows for the remote reading and retrieving of parts of the corefile. As known in the art, the Core Daemon Protocol specifies a set of rules (e.g., data packet format, sequence of events to occur, etc.) for communication between the administering computer (that runs the debugger) and the filer for remotely accessing the corefile stored on the filer. The Core Daemon Protocol provides simple operations such as Open File Operation (Open(“corefile name”)), Read L bytes at address/offset A (Read(Offset, Length)), and Close File Operation (Close) to allow remote open, read, and close file operations to a specified file (“corefile name”). The corefile routing typically stores the corefile to a predetermined path/location on the filer (e.g., the filer memory's root directory) with a predetermined filename. The debugger/programmer will have knowledge of the predetermined path/location and filename so the debugger/programmer can locate and access the corefile.
A core daemon (remote core analysis daemon) is a program used to implement the Core Daemon Protocol. The core daemon program works in conjunction with the debugger and accesses the corefile and responds to requests from the debugger in accordance with the Core Daemon Protocol. In particular, the debugger submits requests for data in the corefile to the core daemon program which retrieves the data from the corefile and sends the data to the debugger. As such, the debugger can receive and analyze data of the corefile without requiring uploading of the entire corefile to the remote administering computer. The core daemon program typically resides on a support console computer connected to the filer. The support console typically runs only specialized administrative applications (such as the core daemon program) for administering and servicing the filer. The support console is configured to access the filer's file system and files stored on the filer.
The two above-mentioned methods for debugging the filer's operating system or other applications (i.e., uploading the corefile to the administering computer or performing remote corefile analysis) are considered offline or non-live debugging since the generated corefile is a static image of the filer's memory at a previous time. Another method for debugging the filer is debugging the current condition of the file as it is running (i.e., live debugging). Typically, in live debugging, a “crash” type error has not occurred, but rather, the filer's read/write operations have slowed significantly or the filer has “wedged.” In live debugging, a debugger is operated by a programmer on a remote administering computer. The debugger gains control of the filer's operating system and “lock steps” the operations of the filer (i.e., does not allow the filer to perform other operations while waiting for commands from the programmer). The filer is unlocked to perform other operations only when the programmer allows the filer to do so. In this way, while the programmer may be pondering a data result, and does not unlock the filer, the filer can not perform other operations.
The current methods for debugging the filer has several disadvantages. The non-live debugging methods that require a corefile to be generated are disruptive in that they require a reboot procedure that disables all filer processes and shuts down the filer for a particular down-time period (until the filer is restarted and able to again perform operations). In addition, the creation of the corefile can significantly increase the down-time period as the corefile routine copies and stores large amounts of filer memory data to the corefile. Although there are several methods for reducing the size of the corefile, the corefile is still relatively large in size (e.g., typically around 6-12 GB for a 30 GB filer memory). Live debugging is also disruptive in that the operations of the filer are shut down by the debugger while waiting for commands from the programmer. Only when the filer is unlocked by the programmer can the filer perform read/write operations. As such, there is a need for a less-disruptive method for debugging a computer system, such as a filer.