Virtualization has become prevalent for numerous reasons. Machine virtualization has been used to increase utilization of hardware resources, improve security, isolate code, facilitate shifting of workloads among machines, enable incompatible operating systems to execute on a same machine, partition a single machine between tenants, and other reasons. Machine virtualization involves a virtualization layer (e.g., a hypervisor) presenting the hardware of a machine as virtual machines (VMs). Each VM typically has its own virtualized hardware such as a virtual disk drive, virtual processors, virtualized memory, etc. Each VM will usually have a guest operating installed thereon; the guest operating system operates as though it were executing directly on the host machine's hardware and the virtualization layer is transparent to the guest operating system.
Machine virtualization has advantages and disadvantages. One disadvantage is excessive resource overhead. Each VM requires storage. Sharing processing time among VMs requires many expensive context switches. Handling privileged instructions can also incur context switching overhead. Each VM has an entire operating system which can require significant storage. Each VM requires its own memory space. The virtualization layer can itself have a large footprint and of uses processor time just to manage resource sharing. Furthermore, virtual machines also take significant time to create, provision, and start executing. Although migration of a VM between hosts is practical and commonly used, migration requires significant time and network bandwidth.
The shortcomings of machine virtualization have led to a resurgence in container virtualization. Container virtualization involves forming isolation environments (containers) from objects of the host operating system; processes, files, memory, etc. A container engine acts as an abstraction layer between a container and the operating system resources. File system objects, namespaces, registry or configuration data, and the like are logically mapped between the operating system and the container. A container might, for instance, appear to have its own file system, when in fact files in a container namespace are mapped by the container engine to files in the operating system's namespace. A container engine might also regulate how much compute resources are available to containers. For instance, processor time, memory, filesystem size, and other quantifiable resources might be proportionally rationed among containers. A container might also have binaries, libraries, and other objects upon which guest software running in a container might depend. Thus, if the host operating system's kernel is sufficiently compatible with a container engine, the container might provide objects such as libraries that enable the container's guest software to effectively execute in a different version of the host operating system. Containers tend to have faster start times than VMs, lower storage requirements, migrate faster, and require less processing overhead for context switching and processor sharing.
Security has been a concern for all types of secure/isolated guest runtime environments (GREs), whether VMs, containers, or otherwise. An objective of GREs is to allow applications of different provenance to share the same host computer. Naturally, there has been concern and measures taken for security. Containers have been considered less secure than VMs because containers usually run under the purview of a same operating system kernel and share a same memory space. Regardless of the type of GRE, most security efforts have focused on protecting the host from threats originating from within a GRE executing on the host. The thought has been that if the host is protected from malicious activity that might originate from within a GRE, the integrity and security facilities of the host can be relied on to maintain walls between the GREs on the host. In other words, each GRE on a host has been protected by protecting the host environment; as long as the host is not compromised the GRE layer on the host has been assumed to sufficiently secure the GREs. This can be seen in the Docker Engine container implementation. The Docker Engine uses the seccomp facility to limit which system calls can be called from within a container, thus making it more difficult for a container to access or alter objects outside the container.
This host-centric security approach has failed to adequately secure GREs. Because the host environment usually has a higher security level (e.g., kernel-mode) than the GREs themselves (e.g., user-mode), GREs are inherently vulnerable to the host environment. Even an uncompromised host environment has the potential to alter the content or behavior of a GRE. What is needed are new ways of securing GREs that focus on internally protecting GREs. New techniques that help secure GREs by limiting what can be done within a GRE are described below. In some cases, even a compromised host environment may have limited ability to in turn compromise or corrupt the GREs that it is hosting and the guest software of the GREs.