Back in 2013, we ported docker on ARM. Shortly afterwards, we wanted to use that to offer ARM builds to our users. However, ARM server hardware at the time was difficult to find and so we started looking for an emulated solution.
Enter QEMU. QEMU is a wonderful project aimed at emulating other CPU architectures. It has two modes of emulation, system emulation and user emulation. In system emulation the emulated system will behave like a VM, with its own emulated kernel. For our container usecase, having to spin up a VM for every container doesn’t sound very appealing. Fortunately, user mode emulation is a much better fit. In that mode, QEMU will run the binary code of a foreign architecture as a host process, and at the same time translate any guest system calls to host system calls.
To see this mode in action, let’s compile a hello world program for ARM and run it:
petrosagg@rachmaninoff ~ % GOARCH=arm go build ./hello.go
petrosagg@rachmaninoff ~ % qemu-arm ./hello
Hello, world!
First steps
So how can this be used to emulate a whole container? Since a container can only access its own private filesystem, the first step is getting the emulator in the container. This is simply means COPY
‘ing the executable in the image:
However, building the above image produces a not so descriptive error:
Step 3 : RUN /usr/bin/qemu-arm /bin/echo Hello from ARM container
---> Running in 9262e39b9ca3
no such file or directory
[8] System error: no such file or directory
The reason for this error is that qemu-arm
is a dynamically linked x86 binary which requires a lot of other x86 binaries that don’t exist in the image. The loader tries to find those files but fails and so reports no such file or directory
. Indeed:
markup
petrosagg@rachmaninoff % ldd qemu-arm
linux-vdso.so.1 (0x00007ffe73fcc000)
libgthread-2.0.so.0 => /usr/lib/libgthread-2.0.so.0 (0x00007f9acac12000)
libglib-2.0.so.0 => /usr/lib/libglib-2.0.so.0 (0x00007f9aca904000)
libz.so.1 => /usr/lib/libz.so.1 (0x00007f9aca6ee000)
librt.so.1 => /usr/lib/librt.so.1 (0x00007f9aca4e6000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f9aca164000)
libm.so.6 => /usr/lib/libm.so.6 (0x00007f9ac9e66000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f9ac9c50000)
libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f9ac9a33000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007f9ac968f000)
libpcre.so.1 => /usr/lib/libpcre.so.1 (0x00007f9ac941f000)
/lib64/ld-linux-x86-64.so.2 (0x00007f9acae14000)
The easiest way to fix this is to use a statically linked version of QEMU which will have no dependencies on its environment. Creating one is a matter of grabbing a copy of the source code, running ./configure
with --static
and then make
. After this, the statically linked executable will be in qemu/arm-linux-user/qemu-arm
bash
git clone git://git.qemu.org/qemu.git
cd qemu
./configure --target-list=arm-linux-user --static
make
With the newly created binary the build is now successful:
Step 3 : RUN /usr/bin/qemu-arm /bin/echo Hello from ARM container
---> Running in 3966c442f619
Hello from ARM container
---> a2b50782c6f8
These are not the binaries you’re looking for
While the previous example worked, it’s still far from done. Let’s try a slightly more complex Dockerfile. For example the same echo
, invoked from another shell script.
This gives a new error:
Step 3 : RUN /usr/bin/qemu-arm-static /bin/sh -c /bin/echo Hello from ARM container
---> Running in 92e10b7eb1f2
/bin/sh: 1: /bin/echo: Exec format error
This error happens when trying to run an ARM binary on x86. But wait! The whole thing is prefixed by the QEMU emulator. What is going on here? The difference between the previous Dockerfile and this one is that this one starts a child process. On Linux, child processes are started by forking and then doing the execve()
system call from the child process. Since QEMU merely translates system calls from the guest process to the host kernel, when the emulated /bin/sh
calls execve("/bin/echo", ..)
, QEMU will happily pass this on to the kernel, but the kernel has no idea what to do with this file since /bin/echo
is an ARM binary!
To fix this issue the kernel needs to know what to do when requested to run ARM ELF binaries. This is done with a binfmt_misc
enabled kernel, compiled either as a module or built-in. With it, you can associate an interpreter, like qemu-arm-static
, with a binary pattern. If a file matches the pattern, binfmt_misc
will run it using the specified interpreter. Let’s load this module and set all ARM binaries to be run using /usr/bin/qemu-arm-static
.
bash
mount binfmt_misc -t binfmt_misc /proc/sys/fs/binfmt_misc
echo ':arm:M::\x7fELF\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x28\x00:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff\xff:/usr/bin/qemu-arm-static:' > /proc/sys/fs/binfmt_misc/register
The build of the forking Dockerfile now works like a charm:
docker
Step 3 : RUN /usr/bin/qemu-arm-static /bin/sh -c /bin/echo Hello from ARM container
---> Running in 75efe5a4be03
Hello from ARM container
---> fc276c0321f3
As you may have noticed, up until now I was using the exec form for the RUN statements. This was required because the alternative shell form always invokes a default shell (/bin/sh
) and passes whatever is after the RUN statement as a parameter. This means that even if you wrote RUN /usr/bin/qemu-arm-static /bin/echo foobar
it would end up running /bin/sh -c '/usr/bin/qemu-arm-static /bin/echo foobar'
, which would bypass the emulator and give the Exec format error
. With the binfmt_misc
module loaded this is no longer required and you don’t even have to prefix every command with qemu-arm-static
. When Docker attempts to run /bin/sh
, the kernel will automatically detect it is an ARM executable and invoke QEMU!
In fact, all our base images have the qemu binary included so if you have your binfmt_misc
setup you can just do this:
Dropping the kernel dependency
This has been great so far. With a correctly configured kernel you can run ARM Docker containers transparently. But what happens if you want to run an ARM builds on some other system like a hosted CI service? Or use the automated builds of Dockerhub? You can’t expect to have access to configure the kernel, and so far the only option one had was to host a custom CI system with a modified kernel.
Let’s think about how binfmt_misc
works for a moment. Some process calls execve()
with a filename containing an ARM executable, the kernel tries to load it with its standard binary format handlers, i.e native ELF and shebang scripts (yes this is handled in the kernel), and if they fail tries to load it with binfmt_misc
. Then, binfmt_misc
matches the executable signature with the one registered to run with /usr/bin/qemu-arm-static
, and then creates a new exec request to the kernel, this time requesting to run the interpreter, passing the original ARM executable as a parameter. Can this be done without specific kernel support and configuration?
Enter QEMU, for real this time. I mentioned previously that QEMU emulates the foreign architecture and translates system calls and signals. This means that when the guest process makes a system call, what really happens is a function call in QEMU which then does the real system call after some processing. Specifically, do_syscall()
in qemu/linux-user/syscall.c
handles all guest system calls.
What if you intercepted all the translations of all execve()
calls and did something similar to what binfmt_misc
does? Let’s define a qemu_execve()
function and then replace the original execve()
call with it:
The above function is a very simple version of binfmt_misc
and has the same signature as the normal execve()
so it can just replace calls to the real execve()
. Let’s see what it does in more detail.
c
return get_errno(execve("/usr/bin/qemu-arm-static", new_argp, envp));
The first thing to observe is that when it does the real execve()
call, it runs /usr/bin/qemu-arm-static
unconditionally. It also uses the original environment (envp
) unmodified, but uses a slightly modified argument vector.
This is how the new argument vector is created, with 3 more slots than the original one. Then, the original arguments are copied in the new argument vector, offset by 3 slots. At this point, new_argp[0]
, new_argp[1]
and new_argp[2]
are undefined and new_argp[3]
has the original argv[0]
of the guest process.
c
new_argp[0] = strdup("/usr/bin/qemu-arm-static");
new_argp[1] = strdup("-0");
new_argp[2] = argv[0];
new_argp[3] = filename;
These are the arguments passed to the emulator so that it can correctly emulate the guest process. new_argp[0]
is the argv[0]
of the emulator, which is just the path to it.
Next, the original argv[0]
for the guest process is preserved using the -0
parameter of QEMU, followed by the argv[0]
we want. This is very important for some binaries with busybox being a prime example.
Finally, new_argp[3]
is the original filename
, which is the path to the ARM executable. After that, new_argv[4]
will contain the guest process’ argv[1]
etc..
So what can you do with this? Basically, if an ARM process starts under the modified QEMU emulator there is no way for it to escape! Any execve
calls will result to a re-instantiation of the emulator through the qemu_execve
function. In reality, there is a bit more that you have to take care in the handler. Specifically, shebang scripts need to be handled in QEMU, before reaching the kernel. See the full code for all the details.
At this point you can use this to write ARM Dockerfiles that will work on any Docker host, as long as the initial process is ran under QEMU. This requirement will make your Dockerfiles look like this:
Improving the syntax
While the above works fine, it’s cumbersome to write everything in exec notation and on top of that prefix all the commands with the emulator path. Let’s see if the syntax can be improved a bit.
As mentioned earlier, Docker converts RUN foobar
to the following command line /bin/sh -c 'foobar'
. In an ARM image however, /bin/sh
will be an ARM binary and will cause the, now well known, Exec format error
. What if you could replace /bin/sh
with something that the host system can run natively, which then calls the emulator and runs the original /bin/sh
?
It turns out that if you rename /bin/sh
to /bin/sh.real
and then put the following contents in /bin/sh
and /usr/bin/sh-shim
an interesting chain reaction happens:
Using year 2015.
2013
2014
2015
execution.sh
htdocs
test
wordpress
wp-balena-setup.sh
Let’s walkthrough the steps the system goes through when Docker runs /bin/sh -c 'foobar'
:
- The kernel receives the
execve("/bin/sh", ..)
from Docker - This file starts with
#!
, so the kernel parses the first line to get the interpreter - The kernel runs
/usr/bin/qemu-arm-static /bin/sh.real /bin/sh -c 'foobar'
- QEMU starts emulating
/bin/sh.real
, which is an ARM binary, with/bin/sh -c 'foobar'
as parameters /bin/sh.real
reads its first parameter,/bin/sh
, and starts interpreting it, ignoring the first line starting with#!
/bin/sh.real
runscp /bin/sh.real /bin/sh
which temporarily restores/bin/sh
to its original contents/bin/sh.real
runsexec /bin/sh ""
which gets expanded toexec /bin/sh -c 'foobar'
- QEMU intercepts the
execve("/bin/sh")
and runs/usr/bin/qemu-arm-static -0 /bin/sh /bin/sh -c 'foobar'
instead /bin/sh
runsfoobar
- QEMU intercepts the
execve("foobar")
and runs/usr/bin/qemu-arm-static -0 foobar foobar
instead - QEMU starts emulating
foobar
- After
foobar
exits,cp /usr/bin/sh-shim /bin/sh
restores the shim
Whoah, that was a lot of steps, but in the end it did what it should! It ran foobar
under the emulator using just RUN foobar
in the Dockerfile. Using the method above your Dockerfiles will now look like this:
Much better.
The reason cross-build-end
is needed is to rename /bin/sh.real
back to /bin/sh
. You can find the full source code for the two cross-build scripts here.
Creating an automated ARM build on Dockerhub
Using all the above you now have a way of writing Dockerfiles that build ARM images and can run anywhere.
As an example, I have created a Github repo that builds python 2.7 from source. Afterwards I followed the normal procedure to create an automated build on Dockerhub.
I hope you enjoyed this hack, happy Christmas hacking!
If you have questions or just want to say hi, you can hangout with us in the balena forums.