Engineers who build machines made of atoms (rather than of software) rely on locality of causality to make machines mostly safe in the presence of failure or attacks: cars have a firewall between the engine and the driver; houses have walls and a lockable door between the inside and the outside. However, computer hardware engineers have worked very hard to eliminate all locality of causality within a computer: that is, on a modern computer, within any given process, any instruction can access any data in the entire address space of the process. Hardware engineers did this because giving the software engineers freedom to use any instruction to access any data makes it very easy to write programs that do what you really want; however having this much freedom also makes it very easy to write programs that do what you really do not want. Although software engineers separate programs into modules (code that exclusively maintains the invariants of its data), they lack appropriate fine-grain hardware primitives with which to efficiently implement enforcement of this separation. This state of affairs contributes to the problem that “machines made of software” (programs) tend to be much less reliable than machines made of atoms.
1.1 Software Correctness Generally
The punishing exactitude and overwhelming complexity of computer programs make the task of writing correct software almost impossible. Further, the stakes are high: we need only cite the title of a 2002 NIST study: “Software Errors Cost U.S. Economy $59.5 Billion Annually: NIST Assesses Technical Needs of Industry to Improve Software-Testing.” Due to software bugs (a) organized crime controls millions of computers, (b) large infrastructural projects are delayed or fail, and (c) people even die. The problem is that one can never do enough testing to ensure program correctness—something else is badly wanted.
Programmers follow certain disciplines designed to reduce mistakes, a common one being “modularity”—a software embodiment of locality of causality mentioned above: programs are separated into parts called “modules” where each module has its own data together with code to manage it. Further, to ensure correctness, the module's code is written in such a way as to maintain certain “data invariants”: properties of the module data which are always true. Some modules manage multiple instances of their data's state, each instance sometimes called an “object” and the module the “class” of the object. While this modularity discipline works well, current computer hardware systems do not protect a module within a program from any possibly errant or malicious behavior of other modules that may violate the module's boundaries; see FIGS. 1 and 2 for examples of one module attacking the data of another. Therefore all modules are vulnerable to the threat of a single mistake, or a deliberate attack, from any one module: the correctness of the whole program is extremely brittle.
1.1.1 Static Analysis
Even when all of the authors of a program are cooperating, even basic partial correctness properties of a program are hard to ensure. There is a sub-field of Computer Science called “static analysis” which amounts to using an analysis program to read in a target program as its input and then attempt to ensure that no matter what input is given to the target program when it runs, certain kinds of errors cannot occur. (This kind of analysis is called “static” because it only examines the program, in contrast to a “dynamic” analysis which examines the program running with a particular input.)
Static analysis of program is quite difficult. It can be made much easier if the hardware upon which the programs run restricts the potential behavior of the program. Putting code into modules and constraining code to operate only on data of its own module is one way to do that. Constraining code that has access to the super powers of kernel mode (see below) is another.
1.2 Modern Computers Generally
Modern microprocessors are organized in a fairly standard way. A very readable and thorough reference on this topic is Randal E. Bryant and David R. O'Hallaron “Computer Systems: A Programmer's Perspective” Prentice Hall 2003; for brevity we refer later to this reference as “BO-2003”. At a high level of abstraction, a single-core microprocessor consists of a central processing unit 031, a random access memory 035, and peripherals.
The “central processing unit” (CPU) performs one of a fixed set of actions one after another according to the instructions of a program, much as a very reliable, tireless, obedient, and utterly unimaginative person might theoretically follow a detailed set of instructions. The CPU has a small amount of scratch space called “registers”; typically there are on the order of 100 or fewer registers to a CPU. Further, some registers (“CPU status registers”) hold special CPU “status” bits or fields which indicate the current mode of the CPU (such as user vs kernel mode; see below) or other properties of the current state of execution. Some registers are exposed to the program; others are for internal use and can be used for purposes such as saving internal temporaries for the duration of an instruction.
The “random access memory” (RAM) is a passive device which maps (1) an “address” to (2) a “datum” stored in a cell at that address, much as cubbyholes on a wall map each cubbyhole's number to the cubbyhole's contents. (While RAM size is also fixed, it is typically on the order of a billion (1 Gigabyte) cells.) The CPU may either:    (1) “load”: read information to a register from a memory cell at a given address, or    (2) “store”: write information from a register to a memory cell at a given address.
The computer's CPU/RAM core is also connected to “peripherals”: external devices enabling interaction with the outside world, such as disk drives, displays, keyboards, mice, etc. To allow a program to interact with these devices, the hardware has either (1) special instructions for sending data to or receiving data from them, or (2) “memory-mapped I/O”: special RAM cells repurposed by the hardware such that writing or reading from these cells interacts with the device (rather than storing the data, as RAM cells would usually do).
A computer is typically designed to move several bits around together in a block, often called a “word”. A computer is characterized by the number of bits in its word, its “word size”, much as an engine is characterized by the total volume of its cylinders. Typical modern computers have 32-bit or 64-bit words and are therefore referred to as “32-bit machines” or “64-bit machines” respectively. For specificity we speak below of a 32-bit machine but nothing prevents the same ideas from application to machines of other word sizes, such as 64-bits.
1.2.1 Software
Information stored in RAM cells can be interpreted as either “data” or as “program”, as follows. There is one special CPU register called the “program counter” (PC) which contains an index into RAM where the next instruction to be followed by the CPU is held. The operation of the computer typically works as follows to “execute” a program:    (1) load the contents of the RAM cell pointed to by the PC,    (2) interpret that data as an “instruction” and follow that instruction,    (3) increment the PC (unless the instruction set it to a new value),    (4) repeat.
“Instructions” to the CPU are typically of one of the following kinds: (a) a data “access”, which is either a “read” (or “load”) of data from RAM into a CPU register, or a “write” (or “store”) of data from a CPU register into RAM, (b) a logical, fixed-point-arithmetic, or floating-point-arithmetic operation on two registers, or (c) a “jump/branch” which sets the PC to a new value, sometimes only if a certain register has a certain value. Such collections of instructions are called “programs” as opposed to the “data” on which they operate. Instructions plus data is called “software”, bits, as opposed to the “hardware”, a machine made of actual atoms, which interpret the software.
Writing and maintaining programs at the low abstraction level of these very small steps tends to be tedious, error prone, and mind-numbing. Therefore, programs are typically written in higher-level “programming languages” providing more useful constructs with which to construct programs. One of the most useful constructs is the “function”: a re-usable sub-program; a function has an “interface” specifying the format and meaning of data “argument(s)” passed as input and “return value(s)” obtained as output (note that hardware instructions and operations can also take “argument(s)”). Programs written in these higher-level languages are translated into executable machine instructions by a special program called a “compiler”. Even after a function is compiled however, the resulting low-level instructions often occur in a contiguous block or can be made to occur in a contiguous block by the compiler, if desired.
1.2.2 Multi-Processing and the Kernel
Special software called the “kernel” runs in a special CPU mode called “kernel mode” which gives it extra powers over normal “user mode”: (1) some data can only be accessed in kernel mode, sometimes indicated by annotating that data with a SUP (supervisor) bit (see FIG. 5, discussed further below), (2) some instructions must be run in kernel mode or have more powers when run in kernel mode, (3) I/O (input/output) devices typically must be accessed in kernel mode (either through special instructions or by a technique called “memory-mapped I/O” where accesses to certain memory addresses are intercepted and interpreted by hardware as access instructions to particular devices); if an instruction attempts to violate these constraints, the CPU faults.
A “multi-processing” computer can run more than one program at once, where each instance of a running program is called a “process”. The kernel uses its powers to manage processes, such as putting them to “sleep” when a resource is requested and “waking” them up again when that resource is available.
Much like a city government, the kernel (mayor) coordinates with further special “software libraries” and “utility programs” (public servants) to: (a) provide commonly-needed but often messy utility services for the processes (citizens), such as interfacing to a particular kind of disk drive, and (b) protect the processes from each other (more on this below). Taken together the kernel and these utility libraries and programs are called the “operating system” (OS) (the city government in our metaphor). Users ask for services using a special hardware instruction called a “system call” or “kernel crossing”.
Whereas the kernel, just like a government, is the only agent with the power to take certain special actions, the kernel can take actions at the request of user processes if it determines that the user is allowed to take the action. That is, the hardware will allow certain operations only when in kernel mode, however these operations may be “wrapped” with a system call to allow the user to request the kernel to do the operation; one example is intermediating between user programs and hardware peripherals.
Further it is important to note that, just as in real life, asking the government to do something for you is slow; that is, for a user program to do a system call/kernel crossing is much slower (usually at least an order of magnitude slower) than for a user function to simply call another user function. Therefore reducing the number of kernel calls in a program is an important efficiency concern.
1.2.3 Operating System
The kernel is sometimes called “the executive”. This usage continues our metaphor of the computer as a city: just as not all of the services provided by a city are provided directly by the office of the executive, not all of the services that users expect to have provided run in kernel mode. The entirety of the services provided to the user is called the “operating system”; it consists of (1) the kernel executive plus (2) other “trusted system software/programs” which provide services but do not run in kernel mode.
One of these trusted system programs is called the runtime “linker-loader”: it loads programs into memory from the disk and prepares them to run. Modern formats for encoding executable programs include facilities for annotating parts of programs with meta-data; some of these annotations can be interpreted as instructions to the linker-loader. For simple programs, there may not be much to do, but some programs link in other modules at runtime, and so this process can be complex in that case. One service that the Java linker-loader provides is running a “verifier” that checks if the code that is about to be run adheres to certain rules which allow that code to be safely run even if that code is untrusted.
1.3 Memory Management Generally
Globals: A program that needs only a fixed amount of memory during its run can allocate all of that state in one place at the start; such state is called “global” state (it is globally associated with the whole program) and is the first of three separate parts into which a process's memory is organized.
Stack: A particular function of a program needs its own local/temporary memory, called its “frame”. A “caller” function, ƒ, may invoke a “callee” function, g, to solve a sub-problem; during the execution of g, the execution of ƒ is suspended. The frame of memory for the execution of g is allocated immediately below (typically) that of ƒ, and when g is done, the memory space that was g's frame may be re-used by a later call. That is, the frames “push” on and “pop” off, like a stack of plates, and so this second part of memory is called the “stack”. Note that since each function call has its own frame, a function ƒ may even call itself and the operation of the two instances of ƒ do not mutually interfere. Some special registers help the program stay located within the stack: (a) sometimes a “frame-pointer” is used to point to the top of the temporaries of the current frame (typically where the arguments stop and the temporaries start), (b) usually a “stack-pointer” is used to point to the “top” of the stack, that is to the next free word (or last used word) on the stack. By convention, the stack usually grows down in memory, leading to the potentially confusing invariant that the “top” of the stack has the “lowest” address and in particular is lower than the “top” of the frame.
Heap: Sometimes a program requires long term “data-structures” (that need to last longer than a stack frame) also do not fit into the fixed-sized global state. There is an area of memory managed by a system “memory allocator” library to which a program can make a request to have a specific amount of contiguous addresses or “space” reserved or “allocated” for a particular use. The library finds some available unused space and returns its initial address called a “pointer to” the space. Once in use for a specific purpose the space is typically called an “object”. When an object is no longer needed it can be “deleted” or “freed” for re-use by making a different call to the same memory allocator library. This third part of memory where such objects are allocated and freed has no simple organizational structure and is called the “heap”. When accessing the data pointed to by an object by a pointer, we are “accessing (reading or writing) through” the pointer.
1.3.1 Virtual Memory
A problem arises in that there is sometimes not enough physical memory 035 to store all of the data of all of the running processes. The usual solution is a scheme called “virtual memory”. Quoting [BO-2003, Section 10.1 “Physical and Virtual Addressing”] (Note that any and all editing is in square brackets; emphasis of non-square-bracket text is in the original):                [M]odern processors designed for general-purpose computing use a form of addressing known as virtual addressing. (See [FIG. 3] [which is a copy of [BO-2003, FIG. 10.2]).                    With virtual addressing, the CPU accesses main memory by generating a virtual address (VA), which is converted to the appropriate physical address [043] before being sent to the memory. The task of converting a virtual address to a physical one is known as address translation . . . . Dedicated hardware on the CPU chip called the memory management unit (MMU) translates virtual addresses on the fly, using a look-up table stored in main memory whose contents are managed by the operating system.                        
1.3.2 The Memory Hierarchy
Thus the MMU 033, in cooperation with the operating system, stores some of the data from virtual RAM on physical RAM 035 and the rest on an external disk drive 046. Any process requesting access to data that is actually on disk is paused, the data is brought in (often requiring other data to be sent out), and then the process re-started. To support this feature, memory is grouped into “pages” that are moved in and out as a whole. Pages may be of different sizes, but in current practice 4-kilobytes is typical and for specificity we speak of this as the page size, though other sizes will work. The external device that stores the pages that are not in RAM is called the “swap” device.
We can see at this point that there are many kinds of memory, some with fast access and small capacity, some with slow access and large capacity, and combinations in between. These kinds of memory are arranged in “layers”, the fast/small layers used when possible and the slow/large layers used when necessary, as follows. (1) Most CPU instructions use CPU registers, access to which is very fast. (2) When the registers are full, the program resorts to using RAM, which is slower, but much larger. RAM actually (usually) has at least two layers: (2.1) small amounts of fast memory where frequently-used RAM address/data pairs are stored called the “cache”, and (2.2) normal RAM. Moving data between the cache and RAM is (usually) handled by the hardware. (3) As described above, when RAM is full, the operating system resorts to using a swap disk, which has huge capacity but is far slower still. (4) Some people may still back up their disks to tape. This whole system is called the “memory hierarchy”.
1.3.3 Page Tables and Page Meta-Data
The MMU and/or OS clearly must track which virtual pages map to which physical pages or disk blocks. That is, for each page of data, “meta-data”, which is data about data, is kept. Quoting [BO-2003, Section 10.3.2 “Page Tables”]:                Figure [FIG. 4] [which is a copy of [BO-2003, FIG. 10.4]] shows the basic organization of a page table. A page table is an array of page table entries (PTEs). Each page in the virtual address space has a PTE at a fixed offset in the page table. For our purposes, we will assume that each PTE consists of a valid bit and an n-bit address field. The valid bit indicates whether the virtual page is currently cached in DRAM. If the valid bit is set, the address field indicates the start of the corresponding physical page in DRAM where the virtual page is cached. If the valid bit is not set, then a null address indicates that the virtual page has not yet been allocated. Otherwise, the address points to the start of the virtual page on disk.                    The example in [FIG. 4] shows a page table for a system with 8 virtual pages and 4 physical pages. Two virtual pages (VP 1, VP2, VP4, and VP7) are currently cached in DRAM. Two pages (VP 0 and VP 5) have not yet been allocated, and the rest (VP 3 and VP 6) have been allocated but are not currently cached.                        
1.3.4 Process Address Spaces
Another problem arises in that if all of these application processes use the same RAM it is difficult for them to cooperate in such a way as to not write on each other's data. The virtual-memory solution is for the operating system and hardware to present an illusion (or abstraction) that each process is the only process running on the computer and has all of RAM to itself; this abstracted RAM is the process's “(virtual) address space”. Quoting [BO-2003, Section 10.4 “VM as a Tool for Memory Management”]:                To this point, we have assumed a single page table 050 that maps a single virtual address space to the physical address space. In fact, operating systems provide a separate page table, and thus a separate virtual address space, for each process.        
Note however that sometimes multiple “lightweight processes” or “threads” are run in the same address space even on a machine that also runs processes in separate address spaces. One common design is that the kernel/operating system also manages these threads and another design is that user-mode (not kernel) “thread manager” software within a process manages them.
1.3.5 Memory Protection
Virtual memory thus prevents processes from accidentally or deliberately overwriting each other's data or that of the operating system itself. This protection aspect of virtual memory has become quite important. Quoting [BO-2003, Section 10.5 “VM as a Tool for Memory Protection”]:                Any modern computer system must provide the means for the operating system to control access to the memory system. A user process should not be allowed to modify its read-only text section [that is, its executable program code]. Nor should it be allowed to read or modify any of the code and data structures in the kernel. It should not be allowed to read or write the private memory of other processes, and it should not be allowed to modify any virtual pages that are shared with other processes, unless all parties explicitly allow it (via calls to explicit interprocess communication system calls).                    As we have seen, providing separate virtual address spaces makes it easy to isolate the private memories of different processes. But the address translation mechanism can be extended in a natural way to provide even finer access control. Since the address translation hardware reads a PTE each time the CPU generates an address, it is straightforward to control access to the contents of a virtual page by adding some additional permission bits to the PTE. Figure [FIG. 5] [which is a copy of [BO-2003, FIG. 10.11]] shows the general idea.            In this example [FIG. 5], we have added three permission bits to each PTE. The SUP 051 bit indicates whether processes must be running in kernel (supervisor) mode to access the page. Processes running in kernel mode can access pages for which SUP is 0. The READ 052 and WRITE 053 bits control read and write access to the page.            If an instruction violates these permissions, then the CPU triggers a general protection fault that transfers control to an exception handler in the kernel. Unix shells typically report this exception as a “segmentation fault”.                        
As you can see, prior art systems usually partition pages into (a) “text” (or executable program code) and (b) “data”. After the program has been loaded into memory, text pages are marked to be executable and read-only by setting the permissions bits in the page table; similarly data pages are usually marked to be non-executable and read-write, though read-only data is possible.
1.3.6 No-Execute (NX) Bit
An operating system with support for the NX bit may mark certain areas of memory as non-executable. The processor will then refuse to execute any code residing in these areas of memory. The general technique, known as executable space protection, is used to prevent certain types of malicious software from taking over computers by inserting their code into another program's data storage area and running their own code from within this section. Intel markets the feature as the XD bit, for eXecute Disable. AMD uses the name Enhanced Virus Protection. The ARM architecture refers to the feature as XN for eXecute Never; it was introduced in ARM v6. While such a system does annotate an entire area of memory as non-executable, it does not allow (1) modification of the operation of only specific kinds of instructions, nor (2) modification of the operation of instructions while they continue to operate.
1.4 Control Flow Generally
Recall that the program counter register points to an address on a text page. At each memory cycle, the data pointed to by the program counter is loaded into the CPU and interpreted as an instruction to be executed by the CPU. The default behavior is that after the instruction is executed, the program counter is incremented by the size of the instruction and the process repeats. Recall further that some instructions can change the program counter so that the “control flow” of execution is transferred to another part of the program. See BO-2003, Section 3.6 “Control”.
The “control transfer” caused by such a “control flow” instruction may be conditional upon the value of another register. If such a change is unconditional it is usually called a “jump” and if conditional, it is usually called a “branch”. Another kind of control transfer is the function/procedure call, detailed in the next section. A function call is initiated by a “call” instruction (in the “caller” function) and terminated by a “return” instruction (in the “callee” function).
How the control flow instruction should modify the program counter in the event of a transfer is specified by a target argument to the control flow instruction. This target argument to a jump, branch, or call instruction can be a constant embedded into the text page; in this case the control transfer is called “static”; this static argument may be specified as an absolute address or simply as an offset to the current address. Alternatively, the argument to a jump, branch, or call instruction can specify a register containing a value to be used as the new program counter; in this case the control transfer is called “dynamic”. (Note: we mean the target argument to the hardware instruction; software function calls also take function arguments; that is, a call instruction initiating a function call has a target argument while the function it initiates separately takes function arguments.)
The return instruction is a bit different. The previous matching call instruction (which made the function call from which the return instruction is returning) pushed the return address to which the call should return onto the stack; the return instruction finds that argument on the stack and uses that as its target argument.
1.4.1 Function/Procedure Calls
Programmers find it convenient for a function to be able to (1) suspend its execution in order to call another function to compute some sub-part of the execution and then (2) resume the first function where it left off upon the return from the sub-function. Note that even these sub-functions may be suspended in order to call further sub-sub-functions, and so on. At the point of call, the function suspended is called the “caller” and the sub-function called is called the “callee”. See BO-2003, Section 3.7 “Procedures”.
Recall that each function needs its own space for its variables that is preserved across calls to sub-functions; this space is called a (stack) frame. The suspend-resume nature of function calls makes it very natural to stack these frames so that those of suspended functions remain present in memory as new frames are added (traditionally below in memory) and removed as sub-functions call and return. That is, as sub-functions are called, the top of the stack of frames grows down (traditionally) and as sub-functions return, the top returns upward to where it had been.
Calls to sub-functions often take arguments at their point of call; for example, a function to compute the area of a circle may take the radius as an argument. Arguments are usually passed either in registers or by putting them on the stack where they appear as variables to the called sub-function. After the arguments are ready, the caller function transfers control to the callee function. Transferring control and pushing the return address is so common that there is often a special instruction for doing both called “call”. An additional argument is passed to the callee which is the value of the program counter to which control should return when the callee completes: the “return address”. Returning from a function by jumping to the return address stored on the stack is also often implemented by a special instruction called “return”.
Note that, similarly to a jump or branch, how a call instruction should compute the new program counter to which it transfers control may be specified either (1) statically by an value fixed in the program text or (2) dynamically by the value of data in a register. The difference between static calls and dynamic calls is important as dynamic calls can be quite difficult for a programmer to constrain.
The caller and callee functions must both adhere exactly to a protocol/contract saying how to exchange this information and cooperate in several ways. The first concern is simply how to exchange information; when a caller function calls a callee function, much data is interchanged between them: (1) the arguments to the callee, (2) the address to which the callee should return when done, (3) the return value from the callee. There are at least two other concerns: (a) “caller-save” vs “callee-save”: who is responsible for saving and restoring registers that the callee may wish to use that already contain values important to the caller; (b) “caller-cleanup” vs “callee-cleanup”: who cleans up the temporaries left on the stack by the callee and resets the stack pointer to where it was before the call. The protocol states how these concerns are handled is called the “calling convention”; note that it has both hardware and software aspects.
Another way to implement a function call from a caller to a callee is for the body of the callee to be wholly inserted into the body of the caller at the point of call (also adjusting the renaming of variables), obviating the need for the usual kind of call/return control transfer or for the creation of a stack frame; this technique is called “function inlining” and is often performed by the compiler, but could be performed by a system linker-loader.
1.4.2 Stack Meta-Data
The stack of frames does not only hold temporary values used by the user program; each frame also holds additional data about the stack data. Recall that data about data is called meta-data, so we call this “stack meta-data”. Some meta-data typically stored on the stack:    (1) The “return address”: often the call instruction pushes onto the stack the address to which the subsequent return instruction should return control.    (2) The “frame-pointer”: some systems maintain a register pointing at the top of the temporaries of the current frame called the “frame-pointer”. Some software pushes the frame-pointer on the stack at some point during the calling sequence and restores it from the stack at some point during the return sequence.
1.4.3 Jump Tables
A jump table is an array of function pointers: the table maps an index to a function. The usual usage is that a function has an index indicating a function to be called, it uses a jump table to map the index to a function pointer and then uses a dynamic call instruction to call that function. If an entire page is devoted to holding this array then virtual memory protections can be used to protect the table in various ways, such as by making it read-only.
1.4.4 Exceptional Control Flow
Situations arise that may not have a well-defined behavior, such as dividing by zero or dereferencing a null pointer. A mechanism is provided for handling these, called “trapping to an exception handler”, as follows. Each exception kind has a number. A special exception table base register points at a special jump table mapping each exception kind number to a function called the “exception handler” for that exception kind. When an exception occurs, the hardware maps the exception number through the exception table to get the address of the exception handler and then calls that handler. When this call returns, control is returned either (a) to the instruction at or (b) the instruction after the one that trapped to the exception, depending on the exception kind. See BO-2003, Section 8.1 “Exceptions”, Section 8.2 “Processes”, Section 8.3 “System calls and Error Handling”, and Section 8.5 “Signals”.
1.4.5 Relationship Between Control Flow and the User/Kernel Boundary
Recall that the CPU has a special mode called kernel mode where instructions have the power do to anything on the machine. Only certain software runs in kernel mode; this software is called “the kernel”. Most of the time programs are running in user mode which constrains them so that they cannot take actions that could harm other programs, such as writing in the virtual address space of other programs or performing input or output. These actions are “dangerous”: they can only be performed in kernel mode. Thus, if a user mode program wants to take one of these actions, it must make a call to the kernel to ask for the action to be taken on its behalf. See BO-2003, Section 8.2.3 “User and Kernel Modes”.
The kernel data and functions live in a special area of memory that cannot even be called into when in user mode. This restriction is to protect kernel data and to ensure that kernel functions cannot be attacked by being called anywhere other than at the top of a “public” kernel function. Therefore another method must be provided for user code to request services from the kernel.
The common method is to provide “system call instruction” which generates an “system call” exception. The purpose of this instruction is not to handle an exceptional circumstance, but simply to halt the processing of the user function, switch into kernel mode, and call to the top of the system call exception handler. Before making the system call the user program puts the system call number and other arguments in registers or on the stack where they can be found. Note that the kernel has the power to simply switch itself back into user mode when desired. This transfer of control from a user mode program to the kernel is called a “system call” or a “kernel crossing”. (Note that the data of the kernel is within the address space of every process but is marked as accessible only when the CPU is in kernel mode.)
The complexity of this system call mechanism makes system calls an order of magnitude (or more) slower than function calls.
1.4.6 Scheduling
Multiple “processes” can pretend to run at the same time, each in their own virtual address space. In reality, the processes usually take turns using the hardware as there are usually more software processes than hardware CPUs to go around. The kernel uses its powers to manage processes, such as putting them to “sleep” when a resource is requested and “waking” them up again when that resource is available. The part of the kernel that manages which processes run and which sleep waiting for their turn is called the “(process) scheduler”.
It is possible to run multiple “threads of control” within the same address space; these are called “threads” and, similar to processes, have a “thread scheduler”. Since multiple threads are all within the same address space, there must be multiple separate stacks of their function temporaries as well; however, threads usually all share one heap.
1.5 Background—Prior Art
For ease of readability, the system disclosed herein is hereinafter referred to as “Hard Object”.
While reviewing the prior art pertinent to the present Hard Object work, for convenience both similarities and contrasts between the prior art and the Hard Object system are discussed together. Please see “List of Non-Patent Reference Keys”, below, for the meanings of reference keys in square brackets.
1.5.1 Prior Work on Hard Object
The first article on Hard Object [WMHK-2009]. Another evaluation done of Hard Object as a class project [KCWWK-2009].
1.5.2 Intel x86 Segmented Addressing
As mentioned above, many architectures support a means of managing permissions on text and data as organized into pages. The Intel x86 architecture is one such. Quoting [I-2005]:                The concept of privilege for pages is implemented by assigning each page to one of two levels: Supervisor level (U/S=0)—for the operating system and other systems software and related data. User level (U/S=1)—for applications procedures and data . . . . When the processor is executing at supervisor level, all pages are addressable, but, when the processor is executing at user level, only pages that belong to the user level are addressable.        
Virtual memory protection allows operating system and user programs to interact without danger to the operating system. However two different user modules within the same program, and therefore the same virtual address space, are not protected from one another. In contrast the Hard Object system disclosed herein can isolate two modules even if they are in the same address space.
The Intel x86 architecture, [G-2005], also supports a means of managing permissions on text addresses and data addresses as organized into “segments” which manage the association of permissions and privilege levels to both text and data addresses. Quoting [I-2005]:                The concept of privilege is implemented by assigning a value from zero to three to key objects recognized by the processor. This value is called the privilege level. The value zero represents the greatest privilege, the value three represents the least privilege . . . . [T]hese levels of privilege can be interpreted as rings of protection. The center is for the segments containing the most critical software, usually the kernel of the operating system. Outer rings are for the segments of less critical software . . . . The processor automatically evaluates access to a data segment by comparing privilege levels . . . . [A] procedure can only access data that is at the same or less privileged level.        
Note that in this prior art Intel system, there are only four such privilege levels. Further, this restriction to a small number, such as four, is pervasive throughout the design—for example, each privilege level has its own stack—and so generalizing the design by increasing the number of privilege levels seems infeasible. Therefore it seems that this small number of privilege levels may constitute the maximum number of “protection domains” into which the set of modules may be partitioned (however also see a different way of using segments hypothesized below). In contrast a Hard Object system can easily have an arbitrary number of domains.
The levels of these prior art Intel domains are ordered and therefore apparently they cannot be made mutually exclusive, thus members of a domain with stronger privilege will always have access to the data of a domain with weaker privilege; in contrast the Hard Object system disclosed herein can partition domains in a mutually-exclusive way.
In most systems in the event of a function call, arguments are passed from caller to callee on the stack, but in the Intel system when functions call across privilege levels the function arguments must be copied from the stack of one privilege level to the stack of the other. In contrast, due to the Hard Object stack protection mechanism, a call across a protection domain in a Hard Object system requires no such copying.
In the above-cited Intel system, instructions that manage the segment permissions can only be executed in kernel mode; in contrast Hard Object allows any module to transfer “ownership” of memory addresses to another module without the need to run a privileged instruction or make a system call—where “ownership” is a concept introduced below to indicate the right of code to access memory addresses and/or also the right to transfer this right to other code.
In the above Intel system, segments of memory can be marked with permissions (or the absence of permission) such as “read-only” or “executable”; however there are major design differences between Intel segments and Hard Object owner module-ids. An Intel segment is associated with the current CPU state and refers to a range of addresses that may be accessed; therefore when a protection boundary is crossed, instructions must execute to change the segment registers that are the embodiment of this CPU state. In contrast a Hard Object owner module-id is associated with an address itself and selects a subset of program text that may access this address; this owner module-id is checked whenever an instruction accesses an address and therefore in a Hard Object system when a protection boundary is crossed by the program counter no kernel calls need be made.
1.5.3 Intel Itanium Protection Keys
The Intel Itanium architecture [Intel-Itanium-2010] provides a mechanism for protecting memory called “protection keys” [Intel-Itanium-2010, volume 2] that has two components:    (1) Each Page Table Entry is annotated with a field called a “protection key”    (2) The CPU is provided with 16 additional “protection key registers”.
When protection keys are being used, when the CPU attempts to access a page, the hardware checks if the value of the protection key field of the PTE of a page matches the value any of the protection key registers; if not, the access faults; if so they access may be allowed as further mediated by other bits annotating the protection key registers.
Quoting [Intel-Itanium-2010, Section 4.1.3]:                Protection Keys provide a method to restrict permission by tagging each virtual page with a unique protection domain identifier.        
Quoting [Intel-Itanium-2010, Section 5.1.2.1]:                it is the responsibility of the OS to use protection keys and the protection key registers (PKRs) to enforce protection.        
Considering the above quotes, it seems that it must be the case that changing a protection key register requires a system call; unfortunately we cannot find a direct quote in the documentation that states this explicitly, but without this requirement protection keys would not “enforce protection” [emphasis added] of data.
Therefore the way in which software seems to be required to use Intel Itanium protection keys differs considerably from the way software can use Hard Object owner module-ids, as follows. A system call currently costs at least a order of magnitude more than a function call; in contrast, in a Hard Object system, when the control flow moves from the code of one module to that of another, the cost incurred is no more than that of a function call, as, due to the Hard Object hardware mechanism, the data pages that may be accessed change automatically with the movement of the program counter.
Most software tends to exhibit a property where most of the computation time is spent in an “inner loop”; therefore introducing a delay in that inner loop can easily change the performance of the software by an order of magnitude. Should an inner loop of a program cross a module boundary (1) a Hard Object system would still be performant, whereas (2) a system attempting modularity separation using Intel Itanium protection keys could easily lose an order of magnitude in performance due to the cost within the inner loop of the system calls or fault handling required to change either (a) the protection key registers or (b) the protection keys on the data pages being accessed.
1.5.4 Mondriaan Memory Protection
Of particular interest, Mondriaan Memory Protection, [WCA-2002; WA-2003; W-2004], and U.S. Pat. No. 7,287,140 Asanovic et al, attaches meta-data to addresses at the word-granularity using a special hardware “permissions tables”.
Protection Domains
In the Mondriaan design there is a concept of “protection domains”. Each domain has its own “permissions table” (plural permissions tables) which attaches “permission values” meta-data to memory addresses. At any particular time, a single protection domain is active, as indicated by the current value of the Protection Domain ID register. Note that the active permissions table must be swapped out on cross-domain calls. This is a heavyweight activity compared to a traditional function call. The Mondriaan scheme does not provide any specific efficient means to perform this swapping. Quoting [WA-2003]:                We believe CPU designers will be motivated to accelerate cross-domain calls to enable the benefits of protected execution.        
In contrast, Hard Object meta-data refers to specific module-ids and, indirectly, their associated subsets of instruction and data addresses. The program counter changes naturally at a function call as part of the calling process and thus little extra work is required when the call also crosses a protection domain boundary. Said another way, the Mondriaan Memory Protection mechanism requires considerably more state to be changed (in the form of a change from one table to the other, with the potential flushing of corresponding caching structures) as a result of a protection boundary change than does Hard Object.
Stack Protection Mechanisms
The Mondriaan design discloses a method of stack data protection using a “frame base” register and a “stack limit” register [WA-2003, Section 3.3]. The Hard Object design does something possibly similar with slightly different names (“caller-protect” and “stack-limit”). However the Mondriaan mechanism for performing a function call across domains requires the use of a heavyweight mechanism they call “call gates” to pass information from one protection domain to another; it seems that in the Mondriaan design, data cannot even be simply passed on the stack as is traditional and fast in both prior art systems and the Hard Object system. Quoting [WCA-2002, Section 3.8]:                Parameters are passed in registers. More elaborate data structures are passed using a very simplified form of marshalling which consists of the caller traversing the data structure and granting appropriate permission to the provider domain . . . . If two domains call each other frequently, they can copy arguments into buffers which are properly exported.        
In contrast, a cross-domain function call in Hard Object system requires no such call-gate mechanism and allows very fast traditional use of the stack to (1) pass data as arguments and (2) return data as return values on the stack even when the two functions are in mutually untrusting modules.
Ownership and Managing Permissions
The Mondriaan design anticipates the Hard Object rule of allowing only an owner to have the ability to set the owner of the address to another module (be careful reading their articles as they actually they use the word “own” to mean more than one thing; I cite the meaning closest to that of Hard Object). [WCA-2002, Section 3.1]: “Every allocated region of memory is owned by a protection domain, and this association is maintained by the supervisor.” [WA2003]: “Only the owner of a memory region may revoke permissions, or grant ownership to another domain.” Note however that the Mondriaan design requires these actions taken by an owner be done using a kernel crossing: [WA-2003] “The MMP supervisor software can enforce additional memory usage policies because all calls for permissions manipulation are made via the supervisor.” However, in contrast Hard Object does not require a kernel crossing to change the owner of some addresses, as a user-mode hardware instruction is provided for this purpose.
1.5.5 Nozue et al.
Of particular interest, [OSSNMS-1992] and U.S. Pat. No. 5,890,189 Nozue, et al. (which is a continuation of U.S. Pat. No. 5,627,987 Nozue, et al.) propose both a “capabilities” system and an “access control lists” (ACLs) system for protecting data pages.
Protection Regions Using Hardware Text Ranges
The Nozue ACLs system associates data pages and text pages that can read and write them, similar to Hard Object, as well as providing other functionality. While the Nozue design seems to contain hardware features that would provide to software the same functionality as the Hard Object owner module-id functionality—though not the Hard Object user-mode ownership transfer feature nor the user-mode integrity bit—the Nozue design contains more hardware complexity than would be needed by software designed for Hard Object hardware. For example, the Nozue design calls for a PTE to contain three access control entries and a pointer to further entries, allowing the construction of an arbitrarily-large linked list of entries which must be read at each access check. In contrast, in the Hard Object design only requires a constant amount of state to be annotated onto a PTE and further all Hard Object checks read only a constant amount of data during the check.
Ownership and Managing Permissions
In the Nozue system it seems that setting the ACLs on a page requires a call into the kernel. In current microprocessor architectures and operating systems, kernel calls are expensive (however they do further suggest a change to a Single Address Space Operating System where kernel calls might be cheaper). In contrast the Hard Object method of transferring address ownership uses a single user-mode hardware instruction (the set-owner-module-id instruction).
The Nozue design does not seem to provide any equivalent of the Hard Object integrity bit.
Stack Protection Mechanisms
The Nozue system also does not seem to provide any method for protecting the stack frame of a function in one module from the code in another module or at least not in a way that would also allow for the traditional contiguous software stack organization (where, for example, function arguments and return values can be passed on the stack); in contrast Hard Object provides a hardware mechanism for protecting the stack frame of a suspended function from an attack by the currently executing function.
1.5.6 Google's Native Client
Google's Native Client [Google-NaCl-2009] is an Open Source software project which attempts to provide isolation between different programs using purely software mechanisms. As such it differs considerably from Hard Object which offers hardware mechanisms. The Native Client project addresses some of the same software problems as Hard Object does, such as the problem of constraining dynamic control flow transfer. They constrain dynamic control transfer by requiring software to mask off the low bits of the target address of a dynamic control transfer so that the transfer can only target an address that is a multiple of, say, 32. They then ensure that instructions on such locations are executable only if a given location is a legitimate target of a dynamic control transfer. In contract, Hard Object solves this problem using a hardware mechanism that indicates which locations are legal targets of dynamic control transfers.
1.5.7 Others
U.S. Pat. No. 4,408,274 Wheatley, et al. is a hardware capabilities system which associates capabilities to a process; Hard Object works the other way, associating data addresses and code that may operate on it. A similar contrast occurs with U.S. Pat. No. 5,892,944 Fukumoto, et al. which seems to attach their rights to threads; again, Hard Object attaches rights to addresses, not threads. In U.S. Pat. No. 6,542,919 Wendorf, et al. and U.S. Pat. No. 5,845,129 Wendorf, et al. a method is disclosed where a memory page is associated with a group of threads; again, in contrast a Hard Object system associates rights to addresses, not threads. U.S. Pat. No. 4,442,484 Childs, Jr., et al. uses privilege levels per task to protect software objects; in contrast, Hard Object requires no need of privilege levels and does not decide access at the whole-task granularity, but instead in a different way by distinguishing rights by module-ids associating instruction and data address.
U.S. Pat. No. 6,941,473 Etoh, et al. provides hardware support for detecting stack smashing; in contrast, Hard Object protects the heap as well as the stack. U.S. Pat. No. 4,701,846 Ikeda, et al. provides hardware support for separation of the heap and the stack; in contrast, Hard Object goes further and separates the heap in a fine-grain way.
U.S. Pat. No. 5,075,842 Lai and U.S. Pat. No. 5,157,777 Lai, et al. provide hardware support for marking some data as special meta-data. U.S. Pat. No. 5,075,845 Lai, et al. and U.S. Pat. No. 5,075,848 Lai, et al. provide pointers to objects stored next to permissions meta-data. In contrast, Hard Object puts its meta-data into the page table, leaving the program's virtual address space uncluttered.
U.S. Pat. No. 4,525,780 Bratt, et al. provides each software object with a 128-bit identifier; in contrast Hard Object requires no special identifiers for software objects (beyond their usual address) and objects are not even a “first class” concept in the hardware, only modules are. U.S. Pat. No. 4,434,464 Suzuki, et al. seems to associate program regions with memory regions and then seems to change access permissions on transition through a jump table when crossing module boundaries; however they require indirection through a jump table rather than allowing direct function calls and they do not seem to supply a method for protecting stack data requiring calling only trusted modules or using separate stacks; in contrast, Hard Object allows direct function calls and protects the stack temporaries of the caller from untrusted callees. Similarly, [WS-1992] proposes associating to memory pages an Access Identifier (AID) and to processes Protection Identifiers (PID) where the PIDs of a process associate protections to a page with a matching AID; in contrast Hard Object requires no such PIDs/AIDs and associates data addresses and instruction addresses, not data addresses and threads/processes.
iWatcher and AccMon, [ZQLZT-2004; ZQLZT-2004b; ZLFLQZMT-2004], check many kinds of memory accesses in a best-effort way that is different from the Hard Object system.
U.S. Pat. No. 7,134,050 Wenzel isolates the objects of each module from other modules such that the objects of a module can only be operated on only by the program text of the same module; however, modules may only communicate through a special message subsystem: “The illustrated embodiments result in a fault containment sub-environment, or set of interfaces, that surround the module instances, deliver messages, schedule execution of the module instance when a message is delivered, and manage memory key (de)activation when each instance is called.” In contrast, the present Hard Object work requires no special message subsystem: modules communicate by normal function calls and no special scheduling mechanism is required.
[EKO-1995] disclose user-readable page table entries: “The page table should be visible (read-only) at application level.” User-readable and writable page table entries seem to be disclosed by [HP-1998] (the emphasis is mine):                64-bit system space refers to the portion of the entire 64-bit virtual address range that is higher than that which contains PT space . . . system space is further divided into the S0, S1, and S2 spaces . . . . Addresses within system space can be created and deleted only from code that is executing in kernel mode. However, page protection for system space pages can be set tip to allow any less privileged access mode read and/or write access . . . . The global page table, also known as the GPT, and the PFN database reside in the lowest-addressed portion of S2 space. By moving the GPT and PFN database to S2 space, the size of these areas is no longer constrained to a small portion of S0/S1 space. This allows OpenVMS to support much larger physical memories and much larger global sections.        
The Exokernel paper, [EKO-1995], on page 4 tantalizingly refers without citation to another hardware design where there is a concept of memory addresses being owned:                Some Silicon Graphics frame buffer hardware associates an ownership tag with each pixel . . . . The application can access the frame buffer hardware directly, because the hardware checks the ownership tag when the I/O takes place.        