1. Technical Field
This invention relates to error tracing, and particularly to error tracing in environments having virtualization layers between host applications and devices.
2. Description of the Related Art
The problem of fault detection and isolation—tracking down a problem in a complex system to its root cause—is a very significant one. In some environments, there is simply a lack of any error reporting information, but in many enterprise-class environments, much effort is invested in raising and logging detected faults. In fault tolerant systems, such information is critical to ensuring continued fault tolerance. In the absence of effective fault detection and repair mechanisms, fault tolerant system will simply mask a problem until a further fault causes failure.
When a problem does arise, its impact is frequently hard to predict. For instance, in a storage controller subsystem, there are many components in the path or “stack” from disk drive to host application. It is difficult to relate actual detected and logged errors to the effect seen by an application or a user host system.
When many errors occur at the same time, it is particularly difficult to determine which of those errors led to a particular application failing. The brute force solution of fixing all reported errors might work, but a priority based scheme, fixing those errors that impacted the application that is most important to the business, would be more cost efficient, and would be of significant value to a system user.
Any lack of traceability also reduces the confidence that the right error has been fixed to solve any particular problem encountered by the user or the application.
Today's systems, with Redundant Array of Inexpensive Drives (RAID) arrays, advanced functions such as Flash Copy, and caches, already add considerable confusion to a top-down analysis (tracing a fault from application to component in system). It takes significant time and knowledge to select the root-cause error that has caused the fault.
With the introduction of virtualization layers in many systems, the problem is growing. Not only does virtualization add another layer of indirection, but many virtualization schemes allow dynamic movement of data in the underlying real subsystems, making it even more difficult to perform accurate fault tracing.
It is known, for example, from the teaching of U.S. Pat. No. 5,974,544, to maintain logical defect lists at the RAID controller level in storage systems using redundant arrays of inexpensive disks. However, systems using plural such arrays together with other peripheral devices, and especially when they form part of a storage area network (SAN), introduce layers of software having features such as virtualization that make it more difficult to trace errors from their external manifestations to their root causes.
There is thus a need for a method, system or computer program that will alleviate this problem, and it is to be preferred that the problem is alleviated at the least cost to the customer in money, in processing resource and in time.