1. Field of the Invention
The present invention relates generally to the handling of data transport failures and fail-over capability for data transfers within a computer system such as a server. More particularly, the present invention is directed to a system and method that establishes a secondary link for fail-over data transfer in the event of primary link failure. More particularly, also the system and method of the present invention are applicable for Input/Output Processors (IOPs) in an I2O (intelligent input/output) environment.
2. Description of Related Art
Computer systems have achieved wide usage in modern society. During operation, a computer system processes and stores data at a speed and at a level of accuracy many times that which can be performed manually. Successive generations of computer systems have permitted ever-increasing amounts of data to be processed at ever-increasing rates.
Computer systems are sometimes operated as stand-alone devices or connected together by way of network connections, typically together with a network server, to form a computer network. When networked together, files and other data stored or generated at one computer system can be readily transferred to another computer system.
A conventional computer system typically includes one or more CPUs (central processing units) capable of executing algorithms forming applications, and a computer main memory. Peripheral devices, both those embedded on a backplane of the computer system, or constructed to be separate therefrom, also typically form portions of a conventional computer system. Computer peripheral devices include, for instance, video graphics adapters, LAN (local area network) interfaces, SCSI (small computer system interface) adapters, and mass storage devices, such as disk drive assemblies.
A computer system further typically includes data buses which permit the communication of data between portions of the computer system. For instance, a host bus, a memory bus, at least one high-speed bus, a local peripheral expansion bus, and one or more additional peripheral buses form portions of a typical computer system.
A peripheral bus is formed, for instance, of a SCSI bus, an EISA (extension to industry standard architecture) bus, an ISA (industry standard architecture) bus, or a PCI (peripheral component interface) bus. The peripheral bus forms a communication path to and from a peripheral device connected thereto. The computer system CPU, or a plurality of CPUs in a multi-processor system, communicates with a computer peripheral device by way of a computer bus, such as one or more of the computer buses noted above. A computer peripheral, depending upon its data transfer speed requirements, is connected to an appropriate computer bus, typically by way of a bus bridge that detects required actions, arbitrates, and translates both data and addresses between the various buses.
A computer peripheral device forming a portion of a single computer system might well be supplied by a manufacturer other than the manufacturer of the computer CPU. If the computer system contains more than one peripheral device, the peripheral devices might also be supplied by different manufacturers. Furthermore, the computer system may be operable pursuant to any of several different operating systems. The various combinations of computer peripheral devices and computer operating systems of which a computer system might be formed quickly becomes quite large.
Software drivers are typically required for each computer peripheral device to effectuate its operation. A software driver must be tailored to be operable together with the operating system pursuant to which the computer system is operable. A computer peripheral device must, therefore, have associated therewith a software driver to be operable together with any of the several operating systems pursuant to which the computer system might be operable. A multiplicity of software drivers might have to be created for a single computer peripheral to ensure that a computer peripheral device is operable together with any of the different operating systems.
The complexity resulting from such a requirement has led to the development of an I2O (intelligent input/output) standard specification. The I2O standard specification sets forth, inter alia, standards for an I/O device driver architecture that is independent of both a specific peripheral device being controlled and the operating system of the computer system at which the device driver is to be installed.
In the I2O standard specification, the portion of the driver that is responsible for managing the peripheral device is logically separated from the specific implementation details for the operating system which is to be installed. Because of this, the part of the driver that manages the peripheral device becomes portable across different computer and operating systems. The I2O standard specification also generalizes the nature of communication between the host computer system and peripheral hardware, thus providing processor and bus technology independence.
Construction of computer systems compliant with the I2O standard specification facilitates formation of a computer system having component portions supplied by different suppliers while also assuring that the different component portions of the computer system shall be operable when connected together. Upgrading an existing computer system to be I2O aware assures that subsequent upgrading of the computer system shall be able to be effectuated simply.
One difficulty inherent in all computer systems is the handling of a variety of system faults and their recovery, also referred to as fault tolerance. The identification, control and isolation of such faults is especially important in current devices which employ Error Checking Correcting (ECC) memory, Redundant Arrays of Inexpensive Drives (RAID) and hot-swappable disk drives, and even hot-swappable power supplies. Within servers, fault tolerance techniques begin with an initial focus on memory and physical storage subsystems, and now includes various fail-over solutions. Current servers implement such fail-over solutions for storage subsystems and LAN-LAN routing.
It has been proposed that the I2O specification may be used to incorporate new levels of fail-over for I/O subsystems. One particular area of interest is the peer-to-peer and clustering capabilities of I2O. Peer-to-peer technology allows two I/O Processors (IOPs) to communicate with each other independently of the host CPUs and the media connecting the two IOPs. Clustering extends the peer-to-peer concept outside of the physical system (or unit) defined by the I2O specification.
Problems arise, however, when a failure occurs across the media connecting two such IOPs, such as disconnection of a communications cable (minor) or a bus lock-up (severe). In non-I2O systems, for example, each driver must have direct knowledge of every underlying transport and media, which since there are a myriad of transport and media types available, implies a lot of complex coding for each driver to handle various contingencies. In the current I2O specification, for example, a fault, upon detection by a transport device, is reported to each device driver or application software, collectively referred to hereinafter as downloadable driver modules or DDMs, using a particular data service pathway. The respective DDMs then automatically close the connection with the remote IOP and lose all of the resources previously allocated by that DDM on that remote IOP. Furthermore, the DDMs, upon transport failure, must tear down their respective operating environments and completely rebuild them. Even if a redundant link is found, the entire buffer allocation and DDM-to-DDM setup must begin anew.
There is, therefore, a need for a computer system and method which minimizes the error handling needs of a device driver or application software, particularly in the event of a primary transport failure.
The present invention is directed to a system and method for maintaining communications within a computer system after a data transport failure across a first link. Fail-over capability is attained by re-establishing communications across a secondary link using different transport mechanisms. For example, between two Input/Output Processors (IOPs) within a computer system, such as a server, a series of data transactions therebetween are queued until transaction completion. Upon detection of a failure condition between the IOPs across the first link, the IOPs engage fail-over mechanisms therein to preserve uncompleted data transactions until communications are re-established across the secondary link.
A more complete appreciation of the present invention and the scope thereof can be obtained from the accompanying drawings which are briefly summarized below, the following detailed description of the presently-preferred embodiments of the invention, and the appended claims.