1. Field of the Invention
The present invention relates generally to a host computer that functions as a server for a distributed computing system. More particularly, the present invention relates to a diagnostic subsystem for the host computer that can be directly or remotely accessed by a system administrator, and which permits the system administrator to enter commands to the host system and to view operation of the host system. Still more particularly, the present invention relates to a diagnostic subsystem for a host server computer that monitors the host server for errors, and in response, takes appropriate action, including alarm generation, error diagnosis, and in some instances, corrective action to restrict or minimize the error condition to permit continued operation of the distributed computing system.
2. Background of the Invention
Distributed computing systems (or computer networks) are generally well known. Such systems permit communications between application programs loaded on numerous computer work stations in a network. There are numerous types of distributed computing systems commonly classified by the geographical extent of their communication capabilities. Some of the terms used to classify the geographical breadth of distributed computing systems are for example local area networks (LAN's), metropolitan area networks (MAN's) and wide area network (WAN's).
Computer network systems such as LAN's have become one of the most important devices for storing and sharing data in a business. Thus, computer networks are one of the most critical pieces of equipment in a business office. A failure in the computer network can cause business operations to grind to a halt. Computer networks typically comprise a plurality of personal computers and other data processing devices connected together for information exchange. At the heart of the computer network is one or more file servers. In most computer networks, file servers are responsible for administrating and storing the documents generated by each of the personal computers (PC's) in the system. In addition to managing the network, file servers also preferably include the capability to monitor for faults in the computer network. If a fault or security breach is detected, the file server provides a warning of the fault and in certain instances may also provide diagnostic operations and may even implement corrective measures.
Because of the fact that file servers are responsible for storing much of the data generated in the computer network, the file server typically has mass storage capabilities. The mass storage typically is implemented in a disk subsystem or disk array. To preserve the integrity of data stored in the disk array, it is common to provide redundant array of inexpensive disk (RAID) protection, to prevent loss of data in the event of server or network system error. Typically, SCSI hard disks are used to permit multiple hard drives to be chained to a single interface connector.
Network interface controller (NIC) cards commonly are used to couple the server to one or more computer networks. Both the SCSI hard disk and the network interface controller (NIC) connect to slots or connectors in the server system board. A server system board includes connectors for receiving external cards, such as the NIC cards. Other bus expansion slots may also be provided on the server system board, such as EISA (Extended Industrial Standard Architecture) and PCI (Peripheral Component Interconnect) expansion slots for receiving peripheral devices designed for those particular bus standards.
Files or data, are maintained by a host processing system within the server. Servers are designed to provide work stations with fast access to files stored by the server. Accordingly, file servers embody a host computer which responds to an operating system program (a popular operating system being, for example, WINDOWS NT.RTM., or NETWARE.RTM.) to not only orchestrate the files but also to maintain file security, file backup, or other file management features. One important aspect which flows from maintaining host functions within a server is the capability to manage the host from a remote site, and to even permit management of the host from sites remote from the network. Recently there has been a steady increase in the number of servers that are used in businesses. Increasingly, servers are placed at each location of a business, rather than using a centralized main frame computer at a single location. Typically, a company has an individual or department responsible for administering all of the file servers. In many if not most instances, the administrator or administration department is physically located at a single site, such as the companies' headquarters. Thus each of the servers must either be maintained and administrated remotely or else personnel must be transported to remote offices to permit on-site management.
An example of a common server host system is illustrated in FIG. 1. As shown generally in FIG. 1, one or more host processors are coupled to a system bus. The host memory couples to the system bus via a memory controller unit. The system bus, in turn, also couples to a PCI bridge which interfaces the system bus to a PCI bus. Various PCI peripherals may reside on the PCI bus. A PCI to EISA bus bridge typically couples the PCI bus to an EISA bus. Various EISA peripherals may couple to the EISA bus. In addition, and as shown in FIG. 1, an APIC bus may couple directly between the host system processor and the PCI to EISA bus bridge. Other peripherals may also couple to the system bus. Network interface cards (NIC's), for connecting the server to the PC's, may couple to either the PCI bus or to the EISA bus.
Many operating systems permit access to the host from a remote site, often call a "virtual terminal." A virtual terminal, while not physically connected to the host, nonetheless allows remote control of certain operations of the host. Products such as Compaq Server Manager.RTM. and Compaq Insight Manager.RTM., obtainable from Compaq Computer Corp., have attempted to address some of the issues involved in managing a network of distributed servers from a single, remote site. These products permit an administrator to be notified of a remote server failure, to reset the server from the remote site, and to access certain information provided on the server console. Compaq's Insight Manager.RTM. permits remote maintenance of the file server as well as local and remote notification of errors. In addition, Insight Manager.RTM. permits the file server to be re-booted from a remote location or from any system on the network. Insight Manager.RTM. also provides control facilities including diagnostic capabilities to analyze the condition of the server system configuration and to update system firmware. Insight Manager.RTM. collects and monitors server data as well as data from each client in the network and allows the network manager to act on the data from a remote location or any work station on the network. In addition, Insight Manager.RTM. includes the capability to set user defined thresholds which permit the server to monitor system parameters and to alert the network manager when an error occurs. Notification in the event of an alert or a failure is delivered in many possible ways including on screen messages, a pager, e-mail, fax and SNMP.
It is certainly beneficial to allow remote control of certain server functions, especially those needed to reset one or more servers within network of servers. Downtime caused by server failure is probably the most costly time involved in running a distributed computer system. The causes of server failure, often termed server host "crash" are numerous. Any number of malfunctions or design flaws associated with the server hardware, server operating system or application program running on a server may account for a server crash. If a server crashes, then file access is often lost and business records are temporarily inaccessible until the cause of failure is fixed.
A true benefit would result if an administrator located remote from the server can do more than be alerted to, and then reset, a failed server. In particular, it would be advantageous for the administrator to determine the cause of server failure so that he/she can possibly prevent future failures before they occur. Prevention of failure is as important, if not more important, than resetting a server that has crashed.
The cause of a failure is generally displayed on the server console at the time in which the server crashes. Moreover, irregularities in the server host hardware or operating system software can be detected upon reset (or "boot"). Those irregularities can lead to future failure if not attended to by the administrator. Accordingly, it would be beneficial to gain access to what is displayed on the server host console not only during server reset (or failure) but also leading up to server reset/failure. Information within the video screens (more particularly the sequence of video screens) displayed on the server console, which occur during server failure or reset would help remotely located administrators determine (and hopefully fix) an existing server failure or potential failure.
The video screens, resulting from a reset or failure of the server, comprise a sequence of video screen changes displayed on the host server console by the operating system, system basic input output system ("BIOS"), server application program or other system software. In particular, capture of two screen change sequences are of particular interest to a server administrator. In order to fix an existing failure or a future failure, it would be beneficial that the administrator be given the sequence of screen changes prior to server failure as well as the sequence of screen changes following a reset. Examples of server failure screens displayed on the server console are Microsoft Corp., Windows NT.RTM. "blue screen" and Novell Corp., NETWARE.RTM. ABEND message which appear on the server console when the respective operating system crashes. These screens provide information such as processor fault indicia, system software routine addresses, and pertinent system memory contents. Upon reset of the server, the power on self test ("POST") code, associated with the aforementioned operating systems, typically performs some system diagnostic functions and displays information regarding failures detected to the server console screen. It would be desirable to capture such sequences and replaying them at a remote management site. It would also be desirable to have a remote console system which would permit diagnostic inquiries to be developed remotely which then could be transmitted to the host server system for execution. Thus, for example, tests can be run to detect open and short circuits in the system, which are the leading causes of failures. The results of these tests would then be relayed at the remote management site.
In addition to remotely monitoring the host server system, it would be advantageous to design a host server which is capable of surviving the majority of system faults. In that vein, it would also be advantageous if failed components could be detected and isolated from further interaction in system operation. It would further be advantageous if failed components in the system could be replaced dynamically with spare components without requiring a system shut down.
In systems with multiple bus masters, it is common to award mastership of the bus based upon priority values assigned to each master. Thus, when two masters request bus mastership at the same time, the master with the higher priority value will be granted mastership. Awarding mastership based only upon the preordained priority value of the bus master may cause the system to operate inefficiently. In addition, awarding mastership in this fashion ignores the type of transaction being executed, and also ignores the relative importance of the targets. It would be advantageous if additional criteria were examined for awarding bus mastership other than simply the relative priority of the masters to enhance system performance.