1. Technical Field
The present application relates generally to an improved data processing system and method. More specifically, the present application is directed to a system and method for in-band problem log data collection between a host system and a storage system.
2. Description of Related Art
A storage area network (SAN) is a network of storage disks. In typical large enterprise implementations, a SAN connects multiple servers to a centralized pool of storage systems. Compared to managing hundreds of servers, each with its own disks, SANs improve system administration. By treating all of a company's storage as a single resource, disk maintenance and routine backups are easier to schedule and control. SANs, as well as other network data processing systems, include multiple layers, such as applications, database applications, file systems, host server systems, network infrastructure, and storage systems.
In modern SAN environments, a failure can be very difficult to debug since SAN environments tend to be very complex. Typically the component in the SAN environment that detects the failure and collects data to determine the root of the problem is a different component than the one that experiences the failure. As a result, the problem leading to the failure is often solved after the failure is detected and is often solved by utilizing external instrumentation and customization on the components of the SAN environment. This usually requires a “retest” of the circumstances that led to the failure that was detected, i.e. recreating the conditions in the SAN environment that led to the failure so that data collection may be performed to determine the cause of the failure. This recreation may encompass having to scan log data structures for information relevant to a determined point-in-time of the failure and then correlating this information from the various logs to attempt to obtain a picture of the state of the system at the time of the failure after the fact.
Because failure detection and data collection is typically performed by a separate component from the one that experiences the failure, data collection from multiple components in the complex SAN environment can often times miss critical information that may aid in debugging the failure due to latency in communications and activation through slower interfaces. Thus, frequently, data from remote host systems, switches, or other storage devices at various layers of the SAN environment, is not available or is collected long after the error condition leading to the failure has passed. For example, some components of the SAN element, e.g., host bus adapter (HBA) buffers, store very small amounts of data, e.g., only a few frames of data stored at a time, that may be quickly overwritten and lost when the failure is not detected and data collection is not performed until a significant amount of time after the failure occurs due to latencies in the mechanisms used to detect failures and collect data. As a result, and also because of the limited logging capability of known SAN environments, some information is lost in the process.