1. Technical Field:
The present invention relates in general to a checkstop architecture for analyzing and debugging errors or failures of systems or sub-systems and in particular to a hierarchical JTAG based checkstop architecture for analyzing and debugging errors or failures in computer systems.
2. Description of the Related Art:
Analyzing and debugging errors and failures are often difficult to accomplish in large, complex computer systems such as in International Business Machine""s (IBM""s) RS6000 workstation. Such complex systems are so widely distributed with numerous key chips, components, and sub-systems that a failure or error that has occurred in one chip, component, or sub-system of the computer system is not realized or recognized by other chips, components, or sub-systems in the computer system. Oftentimes, the entire computer system is not promptly or immediately stopped or halted when such a failure(s) or error(s) has occurred. Thus, the computer system continues to operate and execute even though an error or failure has occurred in at least one of the chips, components, or sub-systems. Also, such present computer systems do not provide an easy way for identifying, locating, and debugging the error(s) or failure(s) that has or have occurred and the source of the error(s) or failure(s). Furthermore, such present computer systems do not provide a way of preserving the state of the system at failure or error so that a complete and accurate state of the entire computer system is provided at the time of failure or error occurrence.
All key chips in such complex computer systems (i.e. RS6000 workstation) include bi-directional checkstop logic. A checkstop is a fatal error that must be handled as quickly as possible. An example of such a fatal error is a parity error. A processor may have detected a non-correctable parity error in a cache memory. Since the parity is bad and a parity error has occurred, a checkstop is triggered so that the error is able to be immediately handled. Other IBM systems have used checkstop to freeze all processor states in multiprocessor systems for each of the processors in the overall computer system. However, a checkstop architecture has not been used for an entire and overall computer system, particularly a complex computer system. Also, a checkstop tree architecture for an entire and overall computer system does not exist wherein the checkstop tree is able to be walked and used to efficiently isolate and identify an error or failure and its location.
Additionally, Joint Test Action Group (JTAG) architectures and features on chips are well known in the art. JTAG is separate and distinct from checkstops. The JTAG architectures and features provide accessibility to error registers on each chip. Access to these error registers allow for the implementation of various error/failure checking, verification, and debugging operations. Thus, the JTAG architectures and features provide secondary or ancillary backdoors into chips.
It is therefore advantageous and desirable to provide a checkstop architecture for an entire and overall computer system, particularly a complex computer system. It is also advantageous and desirable to provide a checkstop architecture for an entire and overall computer system wherein the computer system is promptly or immediately stopped or halted when such a failure(s) or error(s) has occurred within the computer system such as at a chip, component, or sub-system. It is further advantageous and desirable to provide a way of preserving the state of an entire computer system at failure or error so that a complete and accurate state of the entire computer system at the time of failure or error occurrence is still able to be provided. It is still further advantageous and desirable to provide an easy way of identifying, locating, and debugging the error(s) or failure(s) that has or have occurred within an overall computer system and the source of the error(s) or failure(s). It is still also advantageous and desirable to provide a checkstop architecture that utilizes a single-wire checkstop that provides a way for quickly stopping all chips in the system and a JTAG bus that provides a way for querying the error registers in determining which chip pulled checkstop first and what had occurred to cause the error.
It is therefore one object of the present invention to provide a checkstop architecture for an entire and overall computer system, particularly a complex computer system.
It is another object of the present invention to provide a checkstop architecture for an entire and overall computer system wherein the computer system is promptly or immediately stopped or halted when such a failure(s) or error(s) has occurred within the computer system such as at a chip, component, or sub-system.
It is a further object of the present invention to provide a way of preserving the state of an entire computer system at failure or error so that a complete and accurate state of the entire computer system at the time of failure or error occurrence is still able to be provided.
It is still another object of the present invention to provide an easy way of identifying, locating, and debugging the error(s) or failure(s) that has or have occurred within an overall computer system and the source of the error(s) or failure(s).
It is still also a further object of the present invention to provide a checkstop architecture that utilizes a single-wire checkstop that provides a way for quickly stopping all chips in the system and a JTAG bus that provides a way for querying the error registers in determining which chip pulled checkstop first and what had occurred to cause the error.
The foregoing objects are achieved as is now described. A checkstop architecture allows an entire computer system to be immediately halted when a failure(s) or error(s) has occurred at a chip, component, device, sub-system, etc.. The present checkstop architecture provides a way of preserving and later providing the state of the computer system at failure or error. The checkstop architecture utilizes a single-wire checkstop that provides a way for quickly stopping all chips in the system and a JTAG that provides a way for querying the error registers in determining which chip pulled checkstop first and what had occurred to cause the error. The present system and method also utilizes a service processor, various computer devices, and at least one central checkstop collection location. The occurrence of the checkstop at one of the computer devices is detected by its internal checkstop operation. The occurrence of the checkstop is driven to the at least one central checkstop collection location, all other of the computer devices, and the service processor. A single-wire checkstop provides a way for all chips of the entire computer system to be halted when the occurrence of the checkstop has been detected. Error registers of the chips are then queried via a separate JTAG to identify the chip which first pulled checkstop and what caused the error. The service processor captures the state of the entire computer system at the time of checkstop occurrence and determines the initial source of the checkstop by tracing back from the central checkstop collection chip.
The above as well as additional objects, features, and advantages of the present invention will become apparent in the following detailed written description.