Large, complex software systems are used to provide a variety of network services including web, mail, naming, authentication, routing, file transfer, and collaboration services. However, the complexity of these software systems makes perfect construction unrealizable. As a result, when latent software bugs are triggered or exploited by an attacker to gain unauthorized privilege or to deny service, then critical network services may go down. The impact of downtime from network services far exceeds that of a single workstation going down. An entire enterprise may go “off the net” if its DNS server is corrupted. Similarly, if its web server goes down, its corporate web presence may disappear. Finally, the impact on users from an unavailable mail server goes beyond frustration. It also means loss of productivity and potentially loss of business. Likewise, compromised servers can serve as an unwitting repository for malicious software, rootkits, and illicit digital content. In addition, compromised servers may often serve as active high-bandwidth spam host, participate in denial of service attacks, or simply as an on-demand zombie in a botnet.
While intrusion detection and prevention technologies have become mainstream commercial products, a stubborn problem persists: the inevitability of errors. Those errors may be called false positives when innocuous requests or system behaviors are misinterpreted as hostile or false negatives when successful intrusions evade detection. The former may waste considerable man power in the investigation of nonexistent breaches and cause service interruptions. The latter may be even more dangerous—undetected, compromised servers may be used as a jump pad to penetrate internal corporate networks or for other nefarious activity.
Much of today's software is inherently vulnerable to attack and unreliable for critical applications. The inherent problem stems from the complexity of software that introduces in some best case scenarios a bug density of 6 bugs per thousand lines of code (KLOC), average cases 10-12 bugs/KLOC. Some of the bugs lead to unreliability when triggered; others can lead to privilege escalation for unauthorized users when exploited. Furthermore, the manufacture lifecycle of software leaves open the possibility of insider sabotage wherein the code released from the software vendor contains embedded Trojans or backdoors to be used later for nefarious purposes.
Current solutions to this problem seek to add additional pressure on system manufacturers, such as major software vendors to release defect-free software. Most related work has focused on defect or intrusion prevention, detection, and removal. Almost all software-based approaches, though, are subject to being compromised by attacks against the machine.
A significant body of work exists to protect servers against attack, to recover after attack, and also to make servers fault tolerant. Here, we summarize the most current and related work to our approach.
The first line of defense against software flaws and attacks is to build better software by finding and eliminating flaws. These techniques can be useful in reducing exposure of software to attacks and increasing their reliability and should be used prior to deployment. However, they cannot be used to guarantee future behavior of the code. Another tactic is to filter input that a program receives to prevent attacks from exploiting vulnerabilities in code [4]. While effective at stopping many known attacks, filters are unable to stop attacks of unknown type, or attacks that resemble legitimate program input.
In contrast with preventative techniques discussed above, post-release techniques have been developed to account and compensate for successful attacks. Recent work in program instrumentation has enabled programs to detect and recover from faults and attacks [5], [6], [7], [8]. These approaches offer the ability to catch program faults while they occur, then continue executing. In failure-oblivious computing [5], memory de-referencing errors are caught by compiler-inserted runtime checks. Unlike prior techniques, such as safe-C compilers that throw an exception or terminate on unsafe memory accesses [9], failure oblivious computing and other fault-masking approaches such as [6], effectively hide the effect of faults by simply returning manufactured, but incorrect results, from dangerous fault conditions. While these techniques may tolerate the effect of a fault, they may no longer guarantee the session semantics, since they have altered the program's state in response to a bad input. In other words the program may no longer operate correctly.
In a similar vein, error virtualization is a technique used to re-locate program control flow to a known safe state, a so-called rescue point, while invoking the program's native error handling upon program fault detection [8]. The benefit of this technique over failure-oblivious computing, is that the program's native error handling code is forcibly returned on an otherwise unhandled fault condition. This technique at least ensures that the program will remain in a consistent state, if not the correct state for the program input. This technique counts on the host program, to some degree, to build in sufficient error handling techniques or rescue points to handle the manufactured values or the function return codes for error virtualization. Where failure-oblivious computing creates significant overhead in dynamic memory checks, error virtualization is performance efficient, but requires significant testing to identify relevant fault states and rescue points in order to handle potentially dangerous error conditions.
In a different, but related technique, Rx periodically checkpoints program states, then monitors the program for faults [7]. If a fault is detected, then the program is rolled back to a prior checkpoint, and re-executed, but this time in a different environment. If the program failed for environmental reasons, then the re-execution of the program and input in a new environment may result in an acceptable execution. Otherwise, the re-execution may simply cause the program to fault again.
One requirement of these techniques described above is that they require intimate knowledge of the application being protected. In other words, they require source code and are ideally used by the developer of the code, rather than by the acquirer of the server system. The techniques typically change program context, which may result in some interruption in service as well as potentially incorrect states.
What is needed is an architecture and techniques that does not require source code access, nor intimate knowledge of the application code. Additionally, there is a need for a system that provides security against attacks that compromise the root or super user privilege on a machine by observing, adapting and acting to compensate for adverse conditions the server experiences to ensure continued trustworthy service of client requests in the face of software failures or attacks.