Businesses are becoming increasingly reliant on computer networks for mission critical applications. With the emergence of the Internet and the proliferation of global e-business applications, more and more organizations are implementing computing infrastructures specifically designed for reliably accessible data and system availability. Today, even applications such as e-mail have become critical for ongoing business operations.
Faced with increased customer and internal user expectations, organizations are currently striving to achieve the highest availability in their computing systems. Any downtime during mission critical applications can severely impact business operations and cost valuable time, money, and resources. To ensure the highest level of system uptime, organizations are implementing, for example, reliable storage area networks capable of boosting the availability of data for all the users and applications that need it. These organizations typically represent the industries that demand the highest levels of system and data availability, for example, the utilities and telecommunications sector, brokerages and financial service institutions, and a wide variety of service providers.
Developing highly available networks involves identifying specific availability requirements and predicting what potential failures might cause outages. In designing these networks, designers must first understand and define their availability objectives—which can vary widely from one organization to another and even within segments of the same organization. In some environments, no disruption can be tolerated while other environments might be only minimally affected by short outages. As a result, availability is relative to the needs of an application and a function of the frequency of outages (caused by unplanned failures or scheduled maintenance) and the time to recover from such outages.
One of the challenges of building an optical network is building in these availability objectives and characteristics, given the long spans of optical fiber used for example in long haul networks. Typically what is utilized is the construction of multiple diversely routed spans of optical fiber. Despite these redundancy measures, and monitoring techniques used, there is no escaping the reality that the frequency of switch to protect events increases with increasing transport distance.
Optical networks are mature robust transport mechanisms for general data applications. With careful attention to network architecture, optical protection switching mechanisms enable the construction of a network with no single point of failure.
However, these protection switches, though infrequent, involve a small, predictable but very real brief loss of data transmission continuity. In voice or general data applications this has been generally acceptable. In more recent data applications such as high speed optical networks used with mission-critical applications, these brief, infrequent protection switching events may bring about a halt in the application and possibly require lengthy data resynchronization activity before the application is restarted.
Although connectionless packet transport networks are less sensitive to brief interruptions in transport continuity due to sophisticated routing mechanisms, they remain a source for network failure. Connectionless transport can potentially have large, unavoidable variations in latency. These same applications that are sensitive to data transport continuity are also sensitive to latency variations.
In implementing these long haul high speed networks, network designers now consider network availability of primary importance over the costs associated with the implementation and operation of the network. For high volume networks, any downtime may mean the loss of millions of dollars. These availability concerns are now readily apparent in the type of performance levels required of service providers. Service Level Agreements (SLAs), having the “5 9s” (99.999%) level of performance, are now commonplace and a standard performance criteria. Under the “5 9s” level of performance, service providers are permitted no more than 5.25 minutes of downtime per year.
To achieve these very high levels of performance in a high speed network requires a combination of a low failure rate and, a very short recovery time whenever a failure occurs. For the most part, current protection and disaster recovery schemes make use of physical redundancy and an array of robust software-based recovery mechanisms. Physical redundancy has traditionally been achieved by provisioning redundant backup subsystems having substantially the same network elements as the primary network. In effect the primary networks are mirrored in the backup subsystem. In the event of a network failure, network elements such as switches and routers provide alternate and diverse routes on a real-time or predetermined basis. In tandem, software-based recovery schemes complement physical redundancy by minimizing the impact of interrupted customer traffic. Recovery software enhances network availability by automating the recovery process so as to ensure the fastest failover possible. At times, failover may occur so quickly that failovers appear transparent to the customer.
There are several high availability strategies in use today. Among these strategies are protective and restorative schemes based on centralized or distributed execution mechanisms, the priority of data, the network layer in which a failure occurs, link or node failures and real-time or pre-computed failure responses. In one protective strategy, backup resources are allocated on a one-for-one basis in advance of any network failure and regardless of the added expense or the inefficient use of available resources. In another protective strategy, available and previously unassigned resources are immediately allocated and used on a real-time or on a substantially real-time basis, at the expense of recovery speed.
Dedicated and shared use of network resources are two protective schemes currently used in network management. In the dedicated protective strategy, certain network resources are dedicated as backup network elements for use upon the failure of the primary communications channel. Backup resources such backup switches, routers, servers, controllers, interfaces, drives, and links are dedicated as backup to the primary network elements. In the early development of the networking industry, this strategy was referred to as a “hot standby” mode of operation. Upon the detection of a failure of a network element, its corresponding backup network elements were immediately placed in operation. As shown in FIG. 1, the primary network elements are substantially duplicated on the backup pathway. In the event of a failure, data being transmitted on the primary pathway is alternately routed through the backup pathway. In this protective approach to network availability, the backup pathway remains idle, but is immediately made available to data on the primary pathway. As readily apparent, the provisioning of a fully redundant and diverse route adds considerable expense to the installation and operation of the high speed network. Moreover the physical switching of pathways may result in a disruption long enough to bring down a system.
To minimize the costs associated with a dedicated protective strategy, a shared approach as shown in FIG. 2, utilizes a backup pathway which is shared by several other primary pathways in the event of a network failure. In this shared protective scheme, one single pathway provides backup transport for each of the primary pathways. This shared protective scheme is known as a 1:N configuration, where N is the number of primary pathways to share the backup pathway. Shared protective configurations operate under the presumption that only one of the primary pathways may fail at any given time. This presumption, however can only be justified statistically in circumstances where the primary pathways are diversely routed and the occurrence of a failure event at any point on the network is unlikely to cause a failure in a span or node served by the same backup pathway. These same protective strategies have been applied to newly developed high speed networks.
In the optical networking industry, storage area networks (SANs) have used these same protective strategies, with less than acceptable availability performance. A SAN is a network whose primary purpose is the transfer of data between and among computer systems and storage elements. A SAN consists of a communication infrastructure, which provides physical connections, and a management layer, which organizes the connections, storage elements, and computer systems so that data transfer is secure and data is highly available. A major advantage of SANs is the ability to provide any-to-any connectivity between the storage devices and remote computers. This means that multiple computer systems can share a storage device so as to allow for the consolidation of storage devices into one or a few centrally managed platforms. SANs employ Fibre Channel technology to provide 100 mbs or better data transfer speeds which is significantly faster than today's SCSI. At these speeds, SANs are used to perform backup and recovery functions, such as data replication, clustering, and mirroring. However these functions are quite sensitive to data disruption and may also be susceptible to the briefest of network failures.
To ensure that the functional advantages inherent in storage area networks and the like are realized, there is a need for a method and system of transport which is more than highly available or fault-tolerant. The present invention fulfills this need and obviates the deficiencies found in current availability schemes by providing a means of provisioning a continuously available transport network. With the present invention, there is no single point of failure. Only a simultaneous loss of an optical link or equipment in each of the diversely routed pathways would result in total network failure. If however the maximum span length is restrained to that required by the expected level of network availability, this is statistically unlikely. There will not be any network element level failure or fiber cut switch to protection. It is not necessary to provide card level protection in this architecture, hence there will be no card switch to protect either. More specifically, there will be no optical layer protection switches from any source. The optical layer protection stems from the multiple fixed pathways through the optical network.