One or more aspects of the present invention relate to building an image, and in particular to, reducing the time taken to build, upload and download images in a computing environment.
Docker is an open-source project that allows users to package software applications into a software container (see www.docker.com for more information). These containers can be deployed onto any machine that runs Docker and are abstracted from the host hardware and operating system. While a similar concept, Docker does not use virtual machines, but does use virtualization.
A Docker image is a read-only template. For example, an image could contain an Ubuntu operating system with Apache and a web application installed. Images are used to create Docker containers. Docker provides a simple way to build new images, and Docker images that other people have already created can be downloaded. Docker images are the build component of Docker. Docker registries hold images. These are public or private stores from which images can be uploaded or downloaded. A public Docker registry is called a Docker Hub that provides access to a huge collection of existing images. Docker registries are the distribution component of Docker. Docker containers are similar to a directory. A Docker container holds everything that is needed for an application to run. Each container is created from a Docker image. Docker containers can be run, started, stopped, moved, and deleted. Each container is an isolated application platform. Docker containers are the run component of Docker.
To deploy a simple application using Docker, a user is expected to create a directory and, inside the directory, create a Dockerfile (a text file with the name “Dockerfile”). In the Dockerfile, the user describes what they want to include in the software container that they would like to run. The user then runs a “docker build” command, which will examine the directory specified and the Dockerfile contained therein and build a Docker image. The user then runs a “docker run” command, which will create a Docker container that can run anything that has been specified in the Dockerfile.
Docker images are essentially blueprints and can be considered as models that describe what a Docker container should look like, at least when the container is first started. A Docker container is a running instantiation of an image, essentially a working version of an image. A Docker system will know and will be able to display an image on which a container is based.
In the deployment process mentioned above, a user can use a Dockerfile to describe what the user wants the image to contain; however, there are different ways of achieving this effect. For example, in a Dockerfile, the first meaningful component will be a “FROM” line. This line describes on what the new image will be building on top. When a user is creating a new image, they need to specify a base image from which the new image will extend. The base image (the image from which the new image is extending) can be referred to as a parent image and the new image (the image that the user is writing and the image that is extending from the parent) can be referred to as a child image.
In a Docker system, if a user wishes to build a new image, they must have a parent and that parent must exist and be available on a local system when a child is being built. On the same principal, another way to use the Dockerfile to describe what to run is to add/copy files into the image. Any file that is available in the image directory that contains the Dockerfile can be added to the new Docker image, provided the Dockerfile explicitly states to include the file. A user can also run commands when building the image, for example, to install external dependencies, for example from the internet.
For simple Docker images, most users will be expected to extend from an operating system image such as Ubuntu or Centos. For more advanced cases, users will create entire trees of Docker images where a user can define an image that extends from another user-defined image and that image extends from a base image. An image can have any number of children, though every image must have at most one parent. Similarly, images can span any number of generations, an image can have a parent, that parent could have a parent, that image could have a parent and so on until eventually reaching a base image.
In larger architectures, development teams could be responsible for a large number of images. Under such usage, images can be expected to form a hierarchy or tree of images, as shown in FIG. 1. The tree 10 is made up of images 12, with each arrow 14 indicating a parent-child relationship. When a user builds an image, the Docker system will generate a UUID (Universally Unique IDentifier) and assign the UUID to the newly generated image. These UUIDs are 64 characters long consisting of hexadecimal digits, using the numbers 0 to 9 and the letters a to f and being case insensitive. As the name implies, the UUID that the Docker system provides will be different even if the image that was built was exactly the same as another image. When a user starts a container, the user needs to inform the Docker system which image from which the user wishes to build the container and one way to do this is using the image UUID.
Images can also have aliases in the form of name and tag combinations that are separated by a colon. For example an alias could be “ubuntu: 14.04”, where the name of the image is “ubuntu”, referring to the operating system, and the tag is “14.04”, referring to the specific version of Ubuntu in the image. Users can expect a single name across many images, but each will have a different tag to identify a specific version, the time the image was built or a feature-set. These aliases are user-defined, but typically concisely describe the image's contents. In the Docker system, an image alias must always resolve to one specific image using the UUID.
Aliases are more user-friendly ways of interacting with Docker images. A user can start a container by passing in an image alias instead of the image UUID and similarly, most Docker commands will accept image aliases. For example, a user can use an alias to a specific image and create a container of that image. The “FROM” specification in a Dockerfile can refer to the parent image by an alias. If a tag is not specified in an alias, the tag will be set to “latest”. For example, a user that makes an image alias of “hello” would actually be creating an alias of “hello:latest”, although the Docker system will accept “hello” as an alias and resolve the alias to the correct image UUID. An image can have any number of aliases. If an image has zero aliases, then a user will always need to use the UUID to refer to the specific image. Otherwise, any alias can be used and they will all point to a specific image. Aliases can be assigned at any time, as long as the image to which the alias points does exist.
Images can be shared using a Docker registry. Such a registry is a web-based repository where images can be uploaded and downloaded. In order to share an image between two distinct machines, one machine must upload the image to the registry and another machine must download the image from the same registry. The concept of aliases discussed above also extends to registries in that an image stored in a registry can be identified by an alias and hence can be uploaded and downloaded with a user-friendly name. Similarly, a registry exposes a way to give an existing image that was previously uploaded to the registry a new alias. As images are written to a storage device of some kind, there will be a physical limit to the number of images a registry can hold.
Downloading of an image also requires the downloading of the image's parent and that image's parent and so on until a user downloads a base image, which will usually be an image of an operating system. Similarly, uploading an image to the registry will also mean uploading the images from which the image extends. When downloading images from a registry, Docker will skip the downloading of images already present on the local system. For example, if a user downloads a specific image and then tries to download the same image again at a later date, then the second download operation would be skipped. This also applies to image parents. If a user downloads a new image, but already has the parent of the image stored locally, the user will only download the new image, since Docker will only download images that are not present locally. Similarly, uploading an image to a registry will only upload an image that is not already present on the registry. This download/upload skipping works based on the image UUID. If a user gives an image a specific alias and then tries to download an image of the same alias, the user might download an entirely new image, and the parents, if the image UUIDs do not match.
Image based systems such as Docker can be used in a computing environment, which comprises an architecture or a number of computers that run(s) Docker containers after downloading the relevant images. For example, an environment could run three containers, including a web server, a database back-end and a monitor that will keep testing the web server and, by extension, the database that the web server relies upon. There may be several parallel environments with, for example, one for each developer and any number of the environments could be being deployed at the same time. Although the easiest way to share images is via a registry, having a registry per environment may be too costly and a single set of images may be deployed to multiple environments. For simplicity, the environment is given a build label and downloads the relevant images by name and tag, where the tag is the build label. Images are available at the registry at the time they are deployed in an environment.
The simplest approach is to have a build process that simply builds all the images in parent-first order and uploads each to a repository, with the tag set to the current build time. FIG. 2 illustrates this process. A unique identifier is set for the build so that when images are shared on the registry, an environment can download the correct images by this identifier. This identifier is referred to as the build label and is set in step S2.1 of the process detailed in FIG. 2. At step S2.2 there is provided a list of images to be built. In order to build a Docker image, as discussed above, the parent image must be present. Therefore, at step S2.3 the images to be built are sorted so that parents are built before children, although the order of a single image's children is irrelevant.
After an ordered list of images to be built has been generated, a first image to be built is selected at step S2.4 and the image is built at step S2.5 and uploaded to the shared registry at step S2.6. In step S2.6 the tag of each image is set to the current build label so that the environment can download the correct images. This process is repeated through the checking steps S2.7 and S2.8 until all images have been built. Once the process has run out of images to build, then the built images can be deployed to the environment, i.e. all of the images are ready to be downloaded.
In developer teams with a large number of team members and a large enough image hierarchy, having each developer build and transfer their own images leads to unnecessary duplication and stress on the shared registry. Each developer would be storing their entire hierarchy of images on the registry every time they build. Any developer could choose to build only the relevant images, but for a large enough hierarchy, selecting what needs to be re-built would become tedious and time consuming. This could also lead to deployment-side issues, as the script that downloads the images needs to handle only some images to be downloaded.
Unfortunately, each new Docker image that is created will generate a UUID, even with the same directory structure on the same machine at different times or on two machines at the same time. Services such as Docker do not notice that two identical images have been created and/or uploaded to the registry. Docker uses a local cache for building images on a machine. Upon building a new image, if the image that would be built already exists (for example since the image was built earlier), Docker will simply use the locally cached image instead. However, this will only be available to the local system; two distinct machines could not exploit the same cache easily and unnecessary duplication will occur.
Furthermore, if a developer wished to exploit the Docker local cache, the developer would need to keep all of the images on their local system, which may cause issues if the image hierarchy is large enough and the machine's local storage disk size is not capable of holding many versions of the images. In the interests of available disk space, many build processes will wipe any existing artifacts (including images) before anything is built.