As high-speed and high-performance communications become necessary for many applications such as data warehousing, decision support, mail and messaging, and transaction processing applications, a clustering technology has been adopted to provide availability and scalability for these applications. A cluster is a group of one or more host systems (e.g., computers, servers and workstations), input/output (I/O) units which contain one or more I/O controllers (e.g. SCSI adapters, network adapters etc.) and switches that are linked together by an interconnection fabric to operate as a single data network to deliver high performance, low latency, and high reliability. Clustering offers three primary benefits: scalability, availability, and manageability. Scalability is obtained by allowing servers and/or workstations to work together and to allow additional services to be added for increased processing as needed. The cluster combines the processing power of all servers within the cluster to run a single logical application (such as a database server). Availability is obtained by allowing servers to “back each other up” in the case of failure. Likewise, manageability is obtained by allowing the cluster to be utilized as a single, unified computer resource, that is, the user sees the entire cluster (rather than any individual server) as the provider of services and applications.
Emerging network technologies for linking servers, workstations and network-connected storage devices within a cluster include InfiniBand™ and its predecessor, Next Generation I/O (NGIO) which have been recently developed by Intel Corp. and other companies to provide a standard-based I/O platform that uses a channel oriented, switched fabric and separate I/O channels to meet the growing needs of I/O reliability, scalability and performance on commercial high-volume servers, as set forth in the “Next Generation Input/Output (NGIO) Specification,” NGIO Forum on Jul. 20, 1999 and the “InfiniBand™ Architecture Specification,” the InfiniBand™ Trade Association on Oct. 24, 2000.
One major challenge to implementing clusters based on NGIO/InfiniBand technology is to ensure that data messages traverse reliably between given ports of a data transmitter (source node) and a data receiver (destination node), via one or more given transmission (redundant) links of a switched fabric data network. Therefore, service (work) requests from fabric-attached InfiniBand™ clients (e.g., remote systems) to a service provider (e.g., host system) must be properly acknowledged by the service provider (host system) within a certain amount of time. Otherwise, service requests or response messages can get lost in the switched fabric data network. In addition, a lot of cluster bandwidth can be wasted if the fabric-attached InfiniBand™ clients generate unnecessary timeouts and retries for service requests, via data paths in the switched fabric data network.
Currently there are some basic mechanisms defined in the InfiniBand™ Architecture Specification set forth on Oct. 24, 2000 to allow InfiniBand™ clients to determine a response time from a service provider in the switched fabric data network before timing out and retrying service requests. However, these currently defined mechanisms are only intended to compute timeouts based on values that are programmed statistically and do not take in account delays that contribute to the amount of time an InfiniBand™ client has to wait before timing out and retrying a service requests, and dynamic variations as a result of fabric congestion or temporary overload of specific service providers. In addition, no mechanism is provided from the InfiniBand™ Architecture Specification to obtain information regarding the current workload of a service provider and generate dynamic feedback to InfiniBand™ clients about the current workload of the service provider in order to avoid premature timeouts and unnecessary retries. As a result, cluster bandwidth can be wasted generating unnecessary retries and responses to the retries.
Accordingly, there is a need for a more client friendly and less cluster bandwidth wasteful mechanism to prevent unnecessary timeouts and retries for services requests in a switched fabric data network.