This invention relates generally providing to reliability and data integrity in computer systems with shared memory, and more particularly to bus protocols where information is transferred in multiple cycles allowing idle cycles.
Typical server systems need to be connected to a variety of adapters to provide connectivity and high speed access to data. One prevalent means of doing so is through a local bus, such as the PCI bus (Peripheral Component Interconnect). The PCI bus supports multiple adapters on a single bus, and provides high data transfer capabilities. Example adapters could provide network connectivity or high speed access to data. One adapter might provide Fibre Channel connectivity another might provide (IBM) ESCON connectivity. These connections might provide connectivity to DASD storage, network switches or other devices known in the art. The Local Bus (such as PCI in this example) connection of these Adapters permits flexibility in configuring a system since a variety of adapter functions can be chosen that best fit the system requirement. These adapters are normally plug able modules or cards but some could be xe2x80x9chard wiredxe2x80x9d into the system or could be cable connected as is known in the art.
Consider a configuration where multiple adapters providing network connectivity or high speed access to data (storage attachments like Fiber Channel, ESCON and so on) are connected to a processor, with its local memory, through the PCI bus. These PCI-based adapters could be off-the-shelf firmware created by vendors for the mass market or these could be custom built for the firmware configuration at hand. Typical server systems need to be connected to a variety of adapters to provide connectivity and high speed access to data. One prevalent means of doing so is through a local bus, such as the PCI (Peripheral Component Interconnect) bus. The PCI bus supports multiple adapters on a single bus, and provides high data transfer capabilities.
Although PCI based components are designed to conform to a standard, there are numerous occasions when the incompatibilities between the different components in the system, or bugs in the code, can cause the PCI bus to hang. When this happens, data transfers across the bus come to a halt, since the bus is now inaccessible to all agents. Moreover, the only reliable way of resuming operation is to assert a PCI RST# (reset) signal, which also has the undesirable effect of resetting all the trapped error information in the PCI agents. Thus, there is little information to aid in debug efforts, and little reliability can be achieved due to the lack of proper recovery from such a situation.
Typically, the processor program (software) partitions the memory map of the PCI system and allocates memory to each unit or adapter on the bus. Thus, the local data store or (LDS) of the processor complex could have different regions. One region is a user area which is accessible to all units. A second region is for exclusive use of a first unit. The memory map might include any combination of these regions. One important exclusive-use area in this LDS would be memory reserved for the exclusive use of the processor complex unit. This would be the area where the processor stores code to execute, possibly including code used to recover from an error condition.
There is a need to protect certain areas of memory so that only adapters/units on the bus that have been defined to have access authorization to the protected area are able to write to those areas. Without this, an errant adapter (through a micro code or hardware failure) could trash key areas of memory and corrupt it and thus exposing the system to unpredictable results. For example, an errant LAN adapter on the bus could overwrite certain recovery routines in the area of memory where the processor complex has stored such routines. Subsequent recovery action in the system would fail and produce unexpected results.
As explained in the example above and in some other instances the configuration of a PCI address map contains regions of memory deemed protected whose access is restricted to the processor alone. An errant access to this protected region of memory, either due to a hardware or code problem, results in an aborted access on the PCI bus, which manifests itself either as a PCI Target Abort or a PCI Master Abort. There are several other scenarios which cause a similar response on the PCI bus, thus making it difficult to tell one scenario from the other. Fault isolation under such conditions becomes extremely difficult, time consuming and costly and may influence customer satisfaction. This problem becomes even more serious when used with over the shelf components, be it PCI or any other bus. The main reason is that many off-the-shelf type hardware that is often used by most manufacturers does not afford the flexibility to isolate errors with any appreciative granularity. This results in generic error or FRU calls, including replacing of complete I/O subsystems, since detailed information about the error is not available. Therefore, a new system is needed that can isolate errors and provide memory protection. The new system needs to eliminate or substantially reduce debug time by pinpointing the kind of error, as well as making FRU calls easier to handle by pinpointing the source of the error.
The present application is being filed on the same day as related application, attorney docket PO9-99-002, titled xe2x80x9cSYSTEM AND METHOD FOR SELECTIVELY RESTRICTING ACCESS TO MEMORY FOR BUS ATTACHED UNIT IDs.xe2x80x9d
It is an object of the present invention to describe a new, unique means of identifying a hung PCI bus.
It is another object of the present invention to trap relevant and sufficient error information in case of a hung bus.
It is yet another object of the present invention to provide fail-safe access to the processor to obtain status information, so that appropriate action can be taken when an error is detected, including, but not limited to, a FRU (Field Replacement Unit) call, which would identify errant firmware to the IS operator for remedial action.
A method and apparatus for detection of a bus hang with identification and capturing of errors in a network computing environment having at least one bus. A first and a second unit are in processing communication with one another in the environment and both units are capable of transferring data between one another. A status circuit is provided for monitoring the first and second units as well as a counting circuit that is measuring periods of bus inactivity during an active bus transfer sequences. A compare circuit is in processing communication with the first and second units for comparing threshold counts provided with a threshold value circuit. Finally, an error detector mechanism that is responsive to the threshold circuit is provided, capable of detecting a bus hang condition, where the detector asserts an error indication when appropriate.