Typical server systems need to be connected to a variety of adapters to provide connectivity and high speed access to data. One prevalent means of doing so is through a local bus, such as the PCI bus (Peripheral Component Interconnect). The PCI bus supports multiple adapters on a single bus, and provides high data transfer capabilities. Example adapters could provide network connectivity or high speed access to data. One adapter might provide Fibre Channel connectivity another might provide (IBM) ESCON connectivity. These connections might provide connectivity to DASD storage, network switches or other devices known in the art. The Local Bus (such as PCI in this example) connection of these Adapters permits flexibility in configuring a system since a variety of adapter functions can be chosen that best fit the system requirement. These adapters are normally plugable modules or cards but some could be "hard wired" into the system or could be cable connected as is known in the art.
Consider a configuration where multiple adapters providing network connectivity or high speed access to data (storage attachments like Fiber Channel, ESCON and so on) are connected to a processor, with its local memory, through the PCI bus. These PCI-based adapters could be off-the-shelf firmware created by vendors for the mass market or these could be custom built for the firmware configuration at hand. Typical server systems need to be connected to a variety of adapters to provide connectivity and high speed access to data. One prevalent means of doing so is through a local bus, such as the PCI (Peripheral Component Interconnect) bus. The PCI bus supports multiple adapters on a single bus, and provides high data transfer capabilities.
Although PCI based components are designed to conform to a standard, there are numerous occasions when the incompatibilities between the different components in the system, or bugs in the code, can cause the PCI bus to hang. When this happens, data transfers across the bus come to a halt, since the bus is now inaccessible to all agents. Moreover, the only reliable way of resuming operation is to assert a PCI RST# (reset) signal, which also has the undesirable effect of resetting all the trapped error information in the PCI agents. Thus, there is little information to aid in debug efforts, and little reliability can be achieved due to the lack of proper recovery from such a situation.
Typically, the processor program (software) partitions the memory map of the PCI system and allocates memory to each unit or adapter on the bus. Thus, the local data store or (LDS) of the processor complex could have different regions. One region is a user area which is accessible to all units. A second region is for exclusive use of a first unit. The memory map might include any combination of these regions. One important exclusive-use area in this LDS would be memory reserved for the exclusive use of the processor complex unit. This would be the area where the processor stores code to execute, possibly including code used to recover from an error condition.
There is a need to protect certain areas of memory so that only adapters/units on the bus that have been defined to have access authorization to the protected area are able to write to those areas. Without this, an errant adapter (through a micro code or hardware failure) could trash key areas of memory and corrupt it and thus exposing the system to unpredictable results. For example, an errant LAN adapter on the bus could overwrite certain recovery routines in the area of memory where the processor complex has stored such routines. Subsequent recovery action in the system would fail and produce unexpected results.
As explained in the example above and in some other instances the configuration of a PCI address map contains regions of memory deemed protected whose access is restricted to the processor alone. An errant access to this protected region of memory, either due to a hardware or code problem, results in an aborted access on the PCI bus, which manifests itself either as a PCI Target Abort or a PCI Master Abort. There are several other scenarios which cause a similar response on the PCI bus, thus making it difficult to tell one scenario from the other. Fault isolation under such conditions becomes extremely difficult, time consuming and costly and may influence customer satisfaction. This problem becomes even more serious when used with over the shelf components, be it PCI or any other bus. The main reason is that many off-the-shelf type hardware that is often used by most manufacturers does not afford the flexibility to isolate errors with any appreciative granularity. This results in generic error or FRU calls, including replacing of complete I/O subsystems, since detailed information about the error is not available. Therefore, a new system is needed that can isolate errors and provide memory protection. The new system needs to eliminate or substantially reduce debug time by pinpointing the kind of error, as well as making FRU calls easier to handle by pinpointing the source of the error.
The present application is being filed on the same day as related U.S. application Ser. No. 09/301,948 titled "Method and Apparatus for Bus Hang Detection and Identification of Errant Agent for Fail Safe Access to Trapped Error Information."