The invention relates to a customizable file-type aware cache mechanism.
During the creation of a virtual machine (VM) on a host within a cloud computing environment, there is usually the challenge to create the corresponding image file in a very fast way.
Virtual machines are also known as virtual hosts, while a host is also called a server. Hosts are computer systems comprising at least one central processing unit (CPU); they may comprise a local disk too, but this is not mandatory. They may be connected to a network system where they can use a shared file system on at least one network disk via an input/output (IO) infrastructure.
Typical cloud offerings provide a set of predefined configurations. These configurations are associated with a certain file image of a virtual disk, also called virtual machine image. Creating such an image based on a predefined installation either requires a complete installation procedure to run or to copy and to customize an existing image.
Another approach to achieve this is to use the so-called snapshot or backing-file feature available for some image file formats: A common base image is used read-only (RO). A new image is created which references the base image. Every write operation is now done to the new image while the base image remains unchanged. This approach reduces the creation time from 10 minutes down to a few seconds. Another benefit of this approach is the reduced disk and cache usage as many operations on the base image are done on the very same file.
To allow failovers and independence of images from their host, shared file systems are usually used in multi-server environments. Such a system does not scale very well, if e.g. a single host can run ten virtual machines in parallel, already ten hosts of such a configuration result in one hundred virtual machines accessing the same shared file system at the very same time. The access pattern of hundred virtual machines running in parallel is equivalent to random access, causing regular non-flash-disks to search back and forward all the time.
Therefore, the configuration has to reduce the IO operations as much as possible to prevent disk accesses where possible. Although flash devices do not face the search time impact conventional hard disks see, it is still desired to prevent disk accesses where possible, e.g. to extend flash chip lifetimes. An easy solution to this problem is to use large caches and to consolidate write operations.
The heuristic an operating system (OS) uses to determine which data to keep in cache is dependent on many factors. Nevertheless a misuse of a virtual machine resulting in heavy input/output (IO) operations might break the environment as caches might be used for other tasks.
For an environment with one or several base images it is desirable to keep as much data in cache as possible or even to customize which data to keep in cache.
Another issue arises when running in an environment where several images are provided to a customer. Although all images are quite similar and only vary in a small subset of files within the disk image, the images are still seen by the server OS as distinct files.
State of the art approaches to minimize data duplication in memory utilizing hashes do not scale very well and only work on small memory sizes as the search overhead grows massively with the cache size. Providing cache sizes of several Gigabytes renders these approaches useless.
Several state of the art mechanisms exist for limiting the amount of data in a cache in general, as well as for finding and removing duplications in particular. The solutions known in the prior art either use caches with a heuristic which cannot be configured or just copy every data block/file which is accessed.
US2011/0148895 A1 describes how to start the image and clones snapshots which have a pre-filled cache. This approach reduces the cache pages to be stored. US2011/0148895 A1 discloses caching by determining file blocks to be cached based on the validity and performance of a cache entry. A cache image including only cache entries with valid durations of at least a configured deployment date for an image is prepared via an application server for the image. The image is deployed to at least one other application server as a virtual machine with the cache image including only the cache entries with the valid durations of at least the configured deployment date for the image.