Building ARM containers on any x86 machine, even DockerHub

Back in 2013, we ported docker on ARM. Shortly afterwards, we wanted to use that to offer ARM builds to our users. However, ARM server hardware at the time was difficult to find and so we started looking for an emulated solution.

Enter QEMU. QEMU is a wonderful project aimed at emulating other CPU architectures. It has two modes of emulation, system emulation and user emulation. In system emulation the emulated system will behave like a VM, with its own emulated kernel. For our container usecase, having to spin up a VM for every container doesn’t sound very appealing. Fortunately, user mode emulation is a much better fit. In that mode, QEMU will run the binary code of a foreign architecture as a host process, and at the same time translate any guest system calls to host system calls.

To see this mode in action, let’s compile a hello world program for ARM and run it:

petrosagg@rachmaninoff ~ % GOARCH=arm go build ./hello.go
petrosagg@rachmaninoff ~ % qemu-arm ./hello
Hello, world!

First steps

So how can this be used to emulate a whole container? Since a container can only access its own private filesystem, the first step is getting the emulator in the container. This is simply means COPY‘ing the executable in the image:

However, building the above image produces a not so descriptive error:

Step 3 : RUN /usr/bin/qemu-arm /bin/echo Hello from ARM container
 ---> Running in 9262e39b9ca3
no such file or directory
[8] System error: no such file or directory

The reason for this error is that qemu-arm is a dynamically linked x86 binary which requires a lot of other x86 binaries that don’t exist in the image. The loader tries to find those files but fails and so reports no such file or directory. Indeed:

markup petrosagg@rachmaninoff % ldd qemu-arm linux-vdso.so.1 (0x00007ffe73fcc000) libgthread-2.0.so.0 => /usr/lib/libgthread-2.0.so.0 (0x00007f9acac12000) libglib-2.0.so.0 => /usr/lib/libglib-2.0.so.0 (0x00007f9aca904000) libz.so.1 => /usr/lib/libz.so.1 (0x00007f9aca6ee000) librt.so.1 => /usr/lib/librt.so.1 (0x00007f9aca4e6000) libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f9aca164000) libm.so.6 => /usr/lib/libm.so.6 (0x00007f9ac9e66000) libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f9ac9c50000) libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f9ac9a33000) libc.so.6 => /usr/lib/libc.so.6 (0x00007f9ac968f000) libpcre.so.1 => /usr/lib/libpcre.so.1 (0x00007f9ac941f000) /lib64/ld-linux-x86-64.so.2 (0x00007f9acae14000)

The easiest way to fix this is to use a statically linked version of QEMU which will have no dependencies on its environment. Creating one is a matter of grabbing a copy of the source code, running ./configure with --static and then make. After this, the statically linked executable will be in qemu/arm-linux-user/qemu-arm

bash git clone git://git.qemu.org/qemu.git cd qemu ./configure --target-list=arm-linux-user --static make

With the newly created binary the build is now successful:

Step 3 : RUN /usr/bin/qemu-arm /bin/echo Hello from ARM container
 ---> Running in 3966c442f619
Hello from ARM container
 ---> a2b50782c6f8

These are not the binaries you’re looking for

While the previous example worked, it’s still far from done. Let’s try a slightly more complex Dockerfile. For example the same echo, invoked from another shell script.

This gives a new error:

Step 3 : RUN /usr/bin/qemu-arm-static /bin/sh -c /bin/echo Hello from ARM container
 ---> Running in 92e10b7eb1f2
/bin/sh: 1: /bin/echo: Exec format error

This error happens when trying to run an ARM binary on x86. But wait! The whole thing is prefixed by the QEMU emulator. What is going on here? The difference between the previous Dockerfile and this one is that this one starts a child process. On Linux, child processes are started by forking and then doing the execve() system call from the child process. Since QEMU merely translates system calls from the guest process to the host kernel, when the emulated /bin/sh calls execve("/bin/echo", ..), QEMU will happily pass this on to the kernel, but the kernel has no idea what to do with this file since /bin/echo is an ARM binary!

To fix this issue the kernel needs to know what to do when requested to run ARM ELF binaries. This is done with a binfmt_misc enabled kernel, compiled either as a module or built-in. With it, you can associate an interpreter, like qemu-arm-static, with a binary pattern. If a file matches the pattern, binfmt_misc will run it using the specified interpreter. Let’s load this module and set all ARM binaries to be run using /usr/bin/qemu-arm-static.

bash mount binfmt_misc -t binfmt_misc /proc/sys/fs/binfmt_misc echo ':arm:M::\x7fELF\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x28\x00:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff\xff:/usr/bin/qemu-arm-static:' > /proc/sys/fs/binfmt_misc/register

The build of the forking Dockerfile now works like a charm:

docker Step 3 : RUN /usr/bin/qemu-arm-static /bin/sh -c /bin/echo Hello from ARM container ---> Running in 75efe5a4be03 Hello from ARM container ---> fc276c0321f3

As you may have noticed, up until now I was using the exec form for the RUN statements. This was required because the alternative shell form always invokes a default shell (/bin/sh) and passes whatever is after the RUN statement as a parameter. This means that even if you wrote RUN /usr/bin/qemu-arm-static /bin/echo foobar it would end up running /bin/sh -c '/usr/bin/qemu-arm-static /bin/echo foobar', which would bypass the emulator and give the Exec format error. With the binfmt_misc module loaded this is no longer required and you don’t even have to prefix every command with qemu-arm-static. When Docker attempts to run /bin/sh, the kernel will automatically detect it is an ARM executable and invoke QEMU!

In fact, all our base images have the qemu binary included so if you have your binfmt_misc setup you can just do this:

Dropping the kernel dependency

This has been great so far. With a correctly configured kernel you can run ARM Docker containers transparently. But what happens if you want to run an ARM builds on some other system like a hosted CI service? Or use the automated builds of Dockerhub? You can’t expect to have access to configure the kernel, and so far the only option one had was to host a custom CI system with a modified kernel.

Let’s think about how binfmt_misc works for a moment. Some process calls execve() with a filename containing an ARM executable, the kernel tries to load it with its standard binary format handlers, i.e native ELF and shebang scripts (yes this is handled in the kernel), and if they fail tries to load it with binfmt_misc. Then, binfmt_misc matches the executable signature with the one registered to run with /usr/bin/qemu-arm-static, and then creates a new exec request to the kernel, this time requesting to run the interpreter, passing the original ARM executable as a parameter. Can this be done without specific kernel support and configuration?

Enter QEMU, for real this time. I mentioned previously that QEMU emulates the foreign architecture and translates system calls and signals. This means that when the guest process makes a system call, what really happens is a function call in QEMU which then does the real system call after some processing. Specifically, do_syscall() in qemu/linux-user/syscall.c handles all guest system calls.

What if you intercepted all the translations of all execve() calls and did something similar to what binfmt_misc does? Let’s define a qemu_execve() function and then replace the original execve() call with it:

The above function is a very simple version of binfmt_misc and has the same signature as the normal execve() so it can just replace calls to the real execve(). Let’s see what it does in more detail.

c return get_errno(execve("/usr/bin/qemu-arm-static", new_argp, envp));

The first thing to observe is that when it does the real execve() call, it runs /usr/bin/qemu-arm-static unconditionally. It also uses the original environment (envp) unmodified, but uses a slightly modified argument vector.

This is how the new argument vector is created, with 3 more slots than the original one. Then, the original arguments are copied in the new argument vector, offset by 3 slots. At this point, new_argp[0], new_argp[1] and new_argp[2] are undefined and new_argp[3] has the original argv[0] of the guest process.

c new_argp[0] = strdup("/usr/bin/qemu-arm-static"); new_argp[1] = strdup("-0"); new_argp[2] = argv[0]; new_argp[3] = filename;

These are the arguments passed to the emulator so that it can correctly emulate the guest process. new_argp[0] is the argv[0] of the emulator, which is just the path to it.

Next, the original argv[0] for the guest process is preserved using the -0 parameter of QEMU, followed by the argv[0] we want. This is very important for some binaries with busybox being a prime example.

Finally, new_argp[3] is the original filename, which is the path to the ARM executable. After that, new_argv[4] will contain the guest process’ argv[1] etc..

So what can you do with this? Basically, if an ARM process starts under the modified QEMU emulator there is no way for it to escape! Any execve calls will result to a re-instantiation of the emulator through the qemu_execve function. In reality, there is a bit more that you have to take care in the handler. Specifically, shebang scripts need to be handled in QEMU, before reaching the kernel. See the full code for all the details.

At this point you can use this to write ARM Dockerfiles that will work on any Docker host, as long as the initial process is ran under QEMU. This requirement will make your Dockerfiles look like this:

Improving the syntax

While the above works fine, it’s cumbersome to write everything in exec notation and on top of that prefix all the commands with the emulator path. Let’s see if the syntax can be improved a bit.

As mentioned earlier, Docker converts RUN foobar to the following command line /bin/sh -c 'foobar'. In an ARM image however, /bin/sh will be an ARM binary and will cause the, now well known, Exec format error. What if you could replace /bin/sh with something that the host system can run natively, which then calls the emulator and runs the original /bin/sh?

It turns out that if you rename /bin/sh to /bin/sh.real and then put the following contents in /bin/sh and /usr/bin/sh-shim an interesting chain reaction happens:

Using year 2015.

2013
2014
2015
execution.sh
htdocs
test
wordpress
wp-balena-setup.sh

Let’s walkthrough the steps the system goes through when Docker runs /bin/sh -c 'foobar':

The kernel receives the execve("/bin/sh", ..) from Docker
This file starts with #!, so the kernel parses the first line to get the interpreter
The kernel runs /usr/bin/qemu-arm-static /bin/sh.real /bin/sh -c 'foobar'
QEMU starts emulating /bin/sh.real, which is an ARM binary, with /bin/sh -c 'foobar' as parameters
/bin/sh.real reads its first parameter, /bin/sh, and starts interpreting it, ignoring the first line starting with #!
/bin/sh.real runs cp /bin/sh.real /bin/sh which temporarily restores /bin/sh to its original contents
/bin/sh.real runs exec /bin/sh "" which gets expanded to exec /bin/sh -c 'foobar'
QEMU intercepts the execve("/bin/sh") and runs /usr/bin/qemu-arm-static -0 /bin/sh /bin/sh -c 'foobar' instead
/bin/sh runs foobar
QEMU intercepts the execve("foobar") and runs /usr/bin/qemu-arm-static -0 foobar foobar instead
QEMU starts emulating foobar
After foobar exits, cp /usr/bin/sh-shim /bin/sh restores the shim

Whoah, that was a lot of steps, but in the end it did what it should! It ran foobar under the emulator using just RUN foobar in the Dockerfile. Using the method above your Dockerfiles will now look like this:

Much better.

The reason cross-build-end is needed is to rename /bin/sh.real back to /bin/sh. You can find the full source code for the two cross-build scripts here.

Creating an automated ARM build on Dockerhub

Using all the above you now have a way of writing Dockerfiles that build ARM images and can run anywhere.

As an example, I have created a Github repo that builds python 2.7 from source. Afterwards I followed the normal procedure to create an automated build on Dockerhub.

I hope you enjoyed this hack, happy Christmas hacking!

If you have questions or just want to say hi, you can hangout with us in the balena forums.