In many types of computer networks, it is desirable to be able to perform certain management related functions (e.g., configuring, performing diagnostic functions, debugging, software upgrades, etc.) on a computer or other form of processing system from a remote location. One important management function is troubleshooting the processing system to prevent errors and/or fix errors that have occurred.
One particular application in which it is desirable to have this capability is in a storage-oriented network, i.e., a network that includes one or more storage servers that store and retrieve data on behalf of one or more storage clients. A storage server runs an operating system that is susceptible to a number of fatal errors from which it cannot safely recover. One common error is a memory violation where the operating system attempts to read an invalid or non-permitted memory address. Hardware failures or other software failures may also occur. When the operating system detects an internal fatal error it may initiate an action known as a kernel panic.
During a kernel panic, a snapshot of the system's memory may be dumped (core dump) into a core file. A core file is a diagnostic aid used by support engineers to help diagnose and fix system problems. A core file is usually sent to a system support center (e.g., a support enterprise) by manually uploading it to the support center. However, core files may be large (e.g., 12 gb) in size, and because they are typically transferred using secure methods, it often takes a significant amount of time (e.g., up to two-days) to upload a complete core file to the support center. During a core upload, if a problem arose that stopped the transmission of the core file, the process may need to be restarted from the beginning, which adds to the overall transfer time.
Getting a core file to a support center as quickly as possible, so that support engineers can begin diagnosing and fixing a problem, is often extremely important for continued operations. Traditionally, upon receiving notice of a panic, the support center must contact the customer with the storage server onsite and request that they retrieve and upload a core file from the storage server. This manual process introduces a significant delay involved with engaging a customer contact.
Additionally, the manual process requires the customer to upload the core file via ftp, http and https which are simple protocols that do not offer resiliency. In the event that the customer loses connectivity, the core file upload would need to be manually reinitiated without a guarantee that it can be resumed from the last uploaded portion of the core. The size of core files adds to the probability that the transfer will be interrupted.