1. Field of the Invention
The present invention relates to computer instruction codes. More specifically, the present invention relates to a method and an apparatus for automatically isolating native code from platform-independent code written in a safe computer programming language.
2. Related Art
A common trend among computer specialists is to use safe computer-programming languages and systems such as JAVA(trademark) to implement computer programs. (JAVA is a trademark or registered trademark of Sun Microsystems, Inc., Palo Alto, Calif., in the United States and other countries.) Typically, programs written in these safe computer-programming languages are compiled to a platform-independent code for execution within a safe virtual machine on a variety of target computers. In this context, the term xe2x80x9csafexe2x80x9d indicates a level of confidence that the program and runtime system will not interfere with other applications running on the target computer and will not adversely affect memory use.
The growing popularity of these safe computer-programming languages has not, however, obviated the need for using native code on the target computers. Native code is code that has been compiled into the native instruction set of a particular computer processor. While safe languages offer many benefits, including inherent code reliability, increased programmer productivity, and ease of code maintenance, it is quite often desirable to execute user-supplied native code. There are several reasons for accepting this impurity, such as higher performance, access to devices and programming interfaces for which there is no standard mapping from the platform-independent runtime system, and direct interaction with operating system services. Nevertheless, native code is unsafe and, as such, breaks the contract offered by the safe language.
As an example of how native code is accessed by safe code running in a safe environment, FIG. 1 illustrates platform-independent runtime environment 104 accessing native code library 106. Platform-independent runtime environment 104 is contained within process 102 and is typically executing a platform-independent program. Process 102 also includes native code library 106 and platform-independent native interface (PINI) 108. Platform-independent runtime environment 104 and any executing platform-independent programs access native code library 106 through PINI 108. The interaction through PINI 108 can have two forms: downcall 110 (when a platform-independent program calls a native sub-routine) and upcall 112 (when a native sub-routine needs to access data or invoke sub-routines of the platform-independent program). PINI 108 is the only access point to native code library 106 from platform-independent runtime environment 104. In operation, a platform-independent program running in platform-independent runtime environment 104 can make downcall 110 to a sub-routine within native code library 106. In turn, native code library 106 can make an upcall to platform-independent runtime environment 104 to access data and platform-independent sub-routines.
Making native code not violate certain safety policies while it is executing in the same address space as the platform-independent code has been the focus of several research projects. Descriptions of relevant research projects can be found in the following references: Efficient software fault isolation (Wahbe, R., Lucco, S., Anderson, T., and Graham, S., 14th ACM Symposium on Operating Systems Principles, Asheville, N.C. December 1993) describes augmenting native code with safety-enforcing software checks. Safe Kernel Extensions without Runtime Checking (Necula, G, and Lee, P., Proceedings of the Second Symposium on Operating Systems Design and Implementation, Seattle, Wash., 1996) describes statically analyzing native code and proving it to be memory safe. TALx86: A Realistic Typed Assembly Language (Morrisett, G., Crary, K., Glew, N., Grossman, D., Samuels, R., Smith, F., Walker, D., Weirich, S., and Zdancewic, S., Proceedings of ACM SIGPLAN Workshop on Compiler Support for System Software, Atlanta, Ga., May 1999) describes designing a low-level, statically typed target language for compiling native code.
While the methods used in these research projects have been successful to a point and are useful in some circumstances, their usefulness for addressing problems with an arbitrary native library is rather limited. Augmenting the native code with safety-enforcing software checks can incur a substantial performance penalty, which is difficult to accept when considered in conjunction with the fact that the native code is often used as a performance-boosting mechanism. Statically analyzing the native code and proving that it is safe requires the availability of the source code for the native code and the generation of formal proofs of correctness, which is difficult or impossible.
Most platform-independent systems contain a mix of native code, native code compiled from bytecode, native code that is part of the platform-independent virtual machine (PIVM) runtime and interpreter, native code that is part of the core libraries, and, optionally, user-specified native code. While most of this native code is logically part of the PIVM runtime, is designed, implemented, and tested by the developers of the particular implementation of the PIVM, and is totally under their control, user-specified native code has not been subjected to the same rigor and, therefore, is subject to a multitude of problems.
Native code is usually thought of as being written against two interfaces: the PINI, which is its sole interaction with the PIVM and platform-independent application, and the host operating system interfaces involving the usual libraries for input/output (I/O), threading, math, networking, and the like. The host operating system interface is also the interface against which the PIVM is written, and therein lies a problem. The PIVM has to make certain decisions regarding the use of the host operating system interface and of available resources. For example:
Signal handlers may need to be instantiated to handle exceptions that are part of the operation of the PIVM (e.g., to detect null pointer and other memory exceptions, to detect arithmetic exceptions, to detect an interrupt signal, etc.).
The PIVM must choose a memory management regime (involving such things as malloc/free and mmap/munmap) for its own purposes, including the allocation of thread stacks and red zones.
Platform-independent threads are typically mapped onto the underlying system""s threading mechanism and a convention is adopted to suspend and resume threads for garbage collection (GC), to assign threads to GC and compilation tasks, etc.
The PIVM must decide how to manage I/O (e.g., the use of blocking or non-blocking calls).
The core classes automatically take care of freeing some system resources (e.g., closing open file descriptors); this policy does not extend to the very same resources used exclusively by native code.
Few, if any, of these mechanisms are composable, in the sense that it is not possible to take two arbitrary native programs, which use the PINI and the host operating system interface, put them together into one process, and expect the resulting system to work correctly. So, in reality, the user specific native code has to be written to a set of implicit interfaces that do not conflict with the way the PIVM uses system resources. These implicit conventions are rarely documented (because they are highly dependent on the implementation decisions within the PIVM, which are subject to frequent change, and are usually thought of as private to the PIVM), and do not have to be common across even the same vendor""s PIVMs on the same platform, much less PIVMs on differing platforms and certainly not across different vendor""s PIVMs. Furthermore, it is rare that legacy libraries will respect these conventions: the economics of amending these libraries to respect these conventions are prohibitive (e.g., source code to the libraries may not be available to either the vendor of a particular implementation of the PIVM or to the customer using the library). Hence, it may be impossible to use certain libraries from platform-independent applications, or the usability may change with new releases of the PIVM.
These problems are exacerbated by so-called xe2x80x9cPIVM embeddingxe2x80x9d as discussed in A Case for Embedding the JVM into Apps. (Morganthal, J., Internetweek, Jun. 22, 1998, Issue 720) in which the PIVM is treated as a library that can be linked into other applications. In this scenario, it cannot even be mandated that the PIVM be in some way xe2x80x9cin chargexe2x80x9d, because it may be subservient to another application. The issue here is that the PIVM becomes both the provider of functionality (as an embedded service) and the client of functionality (when calling native code) and is expected to control native code loaded by itself and at the same time not to interfere with the way the embedding application uses system resources, system interface, etc.
The very same problems are bound to plague emerging multitasking PIVMs. While the proposed approaches and techniques enabling multitasking in the PIVM vary, one theme is common to all of these efforts: the assumption that no user-supplied native code is run by any of the tasks. This is so because any undesirable operation caused by user-supplied native code (e.g., corrupting memory of other tasks or of the runtime, calling exit( ), or changing signal handlers previously set up by the runtime) can cause a crash of or otherwise jeopardize not only its own task but the whole PIVM, including all the other tasks. Unless a comprehensive approach is found to contain various aspects of the damage runaway native code can cause, native code may have to be banned from safe-language multitasking systems.
It goes without saying that the resultant reliability of systems based on this combination is less than desired. The composition of complex applications based on a PIVM and native libraries is therefore something of a hit or miss nature.
The PIVM needs some system resources (file descriptors, memory, etc.) to perform its essential functions. It can manage these resources when they are required only in support of platform-independent code, because the PIVM, in its role effectively as an ersatz operating system (OS), mediates between the platform-independent application and the underlying resource provider, namely the underlying OS. However, when arbitrary native code coexists with the PIVM, it cannot expect to always find resources available. For example, native code could use up the remaining file descriptors causing failure in the PIVM when it needs to access a file. This happens not only when a platform-independent application opens a file, but also in support of internal operations, such a error logging, class loading, managing memory, etc.
While it is possible to implement the PIVM to cope with resource starvation at arbitrary moments, this level of defensiveness requires the PIVM to pre-allocate all it needs for its essential operations, artificially inflating the application""s usage of system resources. It is also extremely difficult to write and test the PIVM code that must deal with resource starvation.
When there is a problem in the interaction between the native code and the PIVM, debugging can be a nightmare. Simple bugs in native code can cause PIVM data structures to be corrupted leading to random failures long after the problem has occurred. If these bugs have pathologies which are time varying, the bug can manifest itself in arbitrary places within the PIVM.
For the purposes of fault isolation, it would be desirable if native code bugs were clearly identifiable as such: this would at least save considerable effort. Using techniques enforcing safety at the level of binaries can lead to a clear verdict on whether a particular memory safety violation is caused by a user-supplied native library or by some other part of the runtime. Bugs resulting from conflicting use of system resources are much harder to find. Moreover, the road from detection to finding an actual cause can be a long one, especially when no (or no good) tools for mixed-mode debugging exist for a given hardware/OS/PIVM implementation platform.
What is needed is a method and an apparatus that allows a safe program to use sub-routines in a native code library without incurring the problems mentioned above.
One embodiment of the present invention provides a system that facilitates automated isolation of native code within a computer program that has been compiled to a platform-independent code. The system operates by receiving a library containing a native code sub-routine that provides a service to the computer program. The system analyzes the library to determine the symbol name for the native code sub-routine. A proxy sub-routine is generated for each native code sub-routine exported by the native library that forms a link to the native code sub-routine. This proxy sub-routine is placed into a new library using the original name of the native code sub-routine. The system runs the native code sub-routine in one process, and executes the platform-independent code in a separate process. The system invokes the native code sub-routine in the first process by calling the proxy sub-routine from the platform-independent code in the second process.
In one embodiment of the present invention, the system provides a proxy platform-independent native interface (PINI) to the library containing the native code sub-routine. The system transparently transforms local PINI calls into calls to the proxy PINI. Transforming local PINI calls into calls to the proxy PINI preserves the original control flow such that a PINI upcall will be executed by the same thread that called the native sub-routine, and, conversely, subsequent downcalls to the native method will be guaranteed to be executed by the same thread of the process that is executing the native method.
In one embodiment of the present invention, analyzing the library to determine the defined symbol name includes analyzing the library to determine call arguments for the defined symbol name.
In one embodiment of the present invention, analyzing the library to determine call arguments for the defined symbol name is accomplished at runtime by analyzing the current call frame.
In one embodiment of the present invention, the system copies call arguments from the proxy sub-routine to a call to the native code sub-routine.
In one embodiment of the present invention, the system returns a result value from the native code sub-routine to the proxy sub-routine.
In one embodiment of the present invention, operations in the first process are isolated from memory and other system resources belonging to the second process so that an error in the first process does not corrupt memory belong to the second process.
In one embodiment of the present invention, the proxy sub-routine and the native code sub-routine communicate through inter-process communication.
In one embodiment of the present invention, forming the link to the native code sub-routine includes translating a data element from a first address width in the computer program to a second address width in the native code subroutine.