Organizations and business enterprises typically have one or more core service applications that are vital to their operations. For example, many organizations rely on e-mail, contact management, calendaring, and electronic collaboration services provided by one or more service applications. In another example, a database and associated applications can provide the core operations used by the organization. These core services are critical to the normal operation of the organization. During periods of service interruption, referred to as service downtime, organizations may be forced to stop or substantially curtail their activities. Thus, service downtime can substantially increase an organization's costs and reduce its efficiency.
A number of different sources can cause service downtime. Critical services may be dependent on other critical or non-critical services to function. A failure in another service can cause the critical service application to fail. For example, e-mail service applications are often dependent on directory services, such as Active Directory, one configuration of which is called Global Catalog, to function. Additionally, service enhancement applications, such as spam message filters and anti-virus applications, can malfunction and disable a critical service application.
Another source of service downtime is administrative errors. Service administrators might update critical service applications with poorly tested software updates, or patches, that cause the critical service application to fail. Additionally, some service applications require frequent updates to correct for newly discovered security holes and critical flaws. Installing the plethora of patches for these service applications in the wrong order can cause the service application to fail. Additionally, service administrators may misconfigure service applications or issue erroneous or malicious commands, causing service downtime.
Application data is another source of service downtime. Databases used by critical service applications can fail. Additionally, service application data can be corrupted, either accidentally or intentionally by computer viruses and worms. These can lead to service downtime.
Software and hardware issues can also lead to service downtime. Flaws in the critical service application and its underlying operating system, such as memory leaks and other software bugs, can cause the service applications to fail. Additionally, the hardware supporting the service application can fail. For example, processors, power and cooling systems, circuit boards, network interfaces, and storage devices can malfunction, causing service downtime.
Reducing or eliminating service downtime for an organization's critical services can be expensive and complicated. Because of the large number of sources of service downtime, there is often no single solution to minimize service downtime. Adding redundancy to service applications, such as backup and clustering systems, is expensive and/or complicated to configure and maintain, and often fails to prevent some types of service downtime. For example, if a defective software update is installed on one service application in a clustered system, the defect will be mirrored on all of the other service applications in the clustered system. As a result, all of the service applications in the system will fail and the service will be interrupted. Similarly, administrator errors will affect all of the service applications in a clustered system equally, again resulting in service downtime.
It is therefore desirable for a system to reduce service downtime from a variety of sources. It is further desirable that the system operate transparently so that the configuration and operation of the service application is unchanged from its original condition. It is also desirable that the system detects the service application failure or imminent failure and to seamlessly take over the service so that service users cannot perceive any interruption in service during the period that the service application is not functioning, referred to as a “failover” period. It is desirable that the system detects when a failed service application is restored to normal operation, to update the service application with data handled by the system during the service application downtime, and to seamlessly return the control of the service to the service application so that service users cannot perceive any interruption in service during this “failback” period. It is desirable that the system require minimal configuration and installation from service administrators. It is also desirable that the system be robust against failure, self-monitoring and self-repairing, and be capable of automatically updating itself when needed.
Additionally, it is desirable for the system to allow for services to be migrated to new service applications and/or hardware without service users perceiving any interruption in service. It is further desirable that the system be capable of acting in a stand-alone capacity as the sole service provider for an organization or in a back-up capacity as a redundant service provider for one or more service applications in the system. It is still further desirable that the system be capable of providing additional capabilities to the service, thereby improving the quality of the service data received or emitted by the service application. It is also desirable that the system provide administrative safeguards to prevent service administrators from misconfiguring service applications. It is also desirable that the system allow for efficient throughput of network traffic and seamless traffic snooping without complicated packet inspection schemes.