Application level virtualization systems, which isolate an application from the underlying physical hardware for the purposes of protection (fault-tolerance), mobility (application relocation) through checkpoint and restart (with IBM MetaCluster operating on Linux, MetaCluster is a trademark of IBM Corporation in certain countries), deterministic replay, or simply resource isolation as Linux (Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both) Vserver (Vserver is a trademark of Linus Torvalds in certain countries), Virtuozzo (Virtuozzo is a trademark of SWsoft in certain countries) OpenVZ, all have a need for intercepting and changing the original semantic of existing system calls.
One method to do this is to change the system call routine in the kernel to introduce system call interception and modification of the semantic. Performing the necessary changes inside the operating system is difficult, dangerous for the whole system stability and security, and generally not well accepted by users or maintainers, as it increases the kernel complexity and may compromise the integrity of the system and the ability to support it.
Some methods exist to insert code into a program to analyze its behavior, for example by collecting analysis data. This technique of modifying a program to make it analyze itself, is known as an “instrumentation method”. An instrumentation method could be used to instrument the system calls, which could be modified in this way from the user space. However, the existing instrumentation methods perform well enough for debugging purpose, but cannot address high performance requirements, like those of fault tolerant systems.
The “ptrace” method for instrumenting executable code, as used by the Linux strace tool, requires an external controller process, which when notified by signal, stops, introspects, and then restarts the target process at each system call occurrence. The resulting performance overhead is huge, although this method is generic.
The LD_PRELOAD method, also an instrumentation of executable code, performs dynamically linked symbol interposition to intercept and substitute system calls which exist under the form of dynamic symbols. This method is limited to dynamic executables, and is not applicable if a system call is inlined in the library (because there is an associated symbol). Inlined syscalls are now more and more common in recent Linux standard libraries, which makes this method deprecated.
Machine code rewriting is another instrumentation method of executable code: the executable machine code is statically or dynamically rewritten, and when a system call is met, some additional code can be inserted to provide added value. This method doesn't support self-modifying executable code, and the performance overhead can also be very significant. An example is the ATOM product that was available on Digital Equipment Corporation workstations. ATOM inserts code, at compile time, into the program to be analyzed.
There is thus a need for a new method of intercepting all type of system calls during the execution of a program, and to modify their behaviour from user space, while avoiding performance overhead (unacceptable for fault tolerant systems), because kernel code is executed in privileged mode, and cannot be modified by the program.