Using NVIDIA Jetson NGC containers on balenaOS

Balena’s container-based technology streamlines development, deployment, and management of IoT Linux devices. However, when those devices include NVIDIA Jetson-based AI carrier boards and modules, there is the additional challenge of configuring an ML software stack such as PyTorch or TensorFlow and all of its underlying dependencies. BSP drivers, JetPack versions and CUDA libraries must match exactly or it usually won’t work as expected. There is a potential alternative though…

Luckily, NVIDIA makes containers available that are “performance-optimized, tested and ready to deploy” on Jetson devices including all required dependencies which you can find in their NGC Catalog. In order for these containers to access an NVIDIA Jetson’s GPU, they require proprietary software provided by NVIDIA such as NVIDIA Docker, NVIDIA Container Toolkit, and NVIDIA Container Runtime. These tools mount certain Jetson-specific libraries (such as BSP libraries) and device nodes from the host into the container so the main application has everything it needs to run and access the Jetson’s Tegra GPU.

Since none of these Jetson libraries are found on a balenaOS host, the toolkit will not work as intended and the container will fail to access the GPU. We’ve written some blog posts with workarounds that involve building containers from scratch and then installing all of the necessary NVIDIA drivers and software while making sure they match exactly the BSP version on the host OS.

Wouldn’t it be great though if we could somehow leverage these pre-built containers from NVIDIA and execute them successfully on balenaOS? It’s a request we get fairly regularly in our forums, so we’ve outlined a process for doing just that below. If you don’t want to follow along you can jump to the working Dockerfiles in our companion repository, but If you’d like to know the details and potentially enable other Catalog containers to work on balenaOS, read on…

Trial and error

Setting up the Container Runtime

Since the containers in question are meant to run on Jetson OS (which is a version of Ubuntu 18) using the Container Runtime, I set that up to see what was actually being installed on the host and hopefully what was being mounted into the container. We could then try to replicate that on balenaOS. (For this guide, I only had access to a Jetson TX2, which is currently limited to JetPack 4.6.3 and L4T 32.7.3.)

The SDK Manager is the recommended tool to download, flash and install a Jetson device so I installed it on an x86 host desktop. To flash a Jetson TX2 or Nano, your host device needs to run Ubuntu 18, have at least 8 GB of RAM, and a minimum of 80 GB free disk space. With this setup, I successfully downloaded and flashed the TX2 with:
* Jetson OS, drivers and file system v 32.7.3
* Cuda Toolkit for L4T 10.2
* CuDNN 8.2
* NVIDIA Container Runtime 0.10.0
(Note: the SDK Manager was not able to successfully flash the TX2 using a VM, so you may need to use a bare-metal set up!)

Once the Jetson software environment was set up with a keyboard, mouse and monitor attached, it was time to try the first container.

L4T Base Container

For this I chose the simple L4T Base container, using the following command on the Jetson:

sudo docker run -it --rm --net=host -v /usr/local/cuda-10.2/samples:/usr/src/examples --runtime nvidia nvcr.io/nvidia/l4t-base:r32.7.1

Note that I mapped the CUDA samples from the host and also invoked the Container Runtime. To test this container’s GPU access I ran the “deviceQuery” example from inside the container which returned a lot of data including the lines:

Detected 1 CUDA Capable device(s)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1

Result = PASS

To see which device nodes were being mapped into the container, I installed strace and ran the “smokeParticles” demo with the following command:

/usr/src/examples# strace /usr/src/examples/smokeParticles 2>&1 | grep \"/dev

This returned the following output:

openat(AT_FDCWD, "/dev/mods", O_RDWR) = -1 ENOENT (No such file or directory)
faccessat(AT_FDCWD, "/dev/nvhost-ctrl-gpu", R_OK|W_OK) = 0
openat(AT_FDCWD, "/dev/nvgpu-pci", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/dev/nvhost-ctrl-gpu", O_RDWR|O_CLOEXEC) = 4
openat(AT_FDCWD, "/dev/nvmap", O_RDWR|O_SYNC|O_CLOEXEC) = 5
faccessat(AT_FDCWD, "/dev/nvhost-dbg-gpu", R_OK|W_OK) = 0
faccessat(AT_FDCWD, "/dev/nvhost-prof-gpu", R_OK|W_OK) = 0
openat(AT_FDCWD, "/dev/nvhost-ctrl", O_RDWR|O_SYNC|O_CLOEXEC) = 15
statfs("/dev/shm/", {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=16384, f_bfree=16384, f_bavail=16384, f_files=507234, f_ffree=507233, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_NOSUID|ST_NODEV|ST_NOEXEC|ST_RELATIME}) = 0
openat(AT_FDCWD, "/dev/shm/cuda_injection_path_shm", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = -1 ENOENT (No such file or directory)
faccessat(AT_FDCWD, "/dev/nvhost-ctrl-gpu", R_OK|W_OK) = 0
openat(AT_FDCWD, "/dev/nvgpu-pci", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/dev/nvhost-ctrl-gpu", O_RDWR|O_CLOEXEC) = 17
faccessat(AT_FDCWD, "/dev/nvhost-dbg-gpu", R_OK|W_OK) = 0
faccessat(AT_FDCWD, "/dev/nvhost-prof-gpu", R_OK|W_OK) = 0
openat(AT_FDCWD, "/dev/nvhost-prof-gpu", O_RDWR|O_CLOEXEC) = 18
openat(AT_FDCWD, "/dev/nvhost-ctrl", O_RDWR|O_SYNC|O_CLOEXEC) = 19

Reviewing the valid device nodes (ones without “ENOENT”) yields the following list:

/dev/nvhost-ctrl-gpu

/dev/nvmap

/dev/nvhost-dbg-gpu

/dev/nvhost-prof-gpu

/dev/nvhost-ctrl

Now that we have this information, we can try running the container on balenaOS without the Toolkit. The equivalent docker-compose file to test this on balenaOS would be the following:

version: '2'
services:
cuda-toolkit:
image: nvcr.io/nvidia/l4t-base:r32.7.1
devices:
- "/dev/nvhost-ctrl-gpu:/dev/nvhost-ctrl-gpu"
- "/dev/nvmap:/dev/nvmap"
- "/dev/nvhost-dbg-gpu:/dev/nvhost-dbg-gpu"
- "/dev/nvhost-prof-gpu:/dev/nvhost-prof-gpu"
- "/dev/nvhost-ctrl:/dev/nvhost-ctrl"
- "/dev/shm:/dev/shm"

Before we push this, we need to make sure that the container is running the same version of L4T as the balenaOS host device, in our case a Jetson TX2. We’re using balenaOS version 2.113.33, and when we issue the command uname -a we seel4t-r32.7.3 so the nvcr.io/nvidia/l4t-base:r32.7.1 image from the catalog would be the closest match.

However, if we push this to our balenaOS Jetson TX2 and run deviceQuery, we get:

-> CUDA driver version is insufficient for CUDA runtime version

Result = FAIL

This tells us that the container expects to get the CUDA driver from the host, so we’ll need to install the same version as the SDK Manager installed when we were testing on Jetson OS with the Container Toolkit.

In addition, if we are OK with making the container privileged, we can remove the individual device nodes since privileged containers have access to everything in /dev.

The service in our compose file then becomes:

cuda-toolkit:
build: ./cuda-toolkit
privileged: true

And in our Dockerfile (in the cuda-toolkit folder) we install BSP 32.7.3:

FROM nvcr.io/nvidia/l4t-base:r32.7.1

# Don't prompt with any configuration questions 
ENV DEBIAN_FRONTEND noninteractive 
# Install some utils 

RUN apt-get update && apt-get install -y lbzip2 git wget unzip jq xorg tar python3 libegl1 binutils xz-utils bzip2 

ENV UDEV=1 

# Download and install BSP binaries for L4T 32.7.3, note new nonstandard URL 

RUN cd /tmp/ && wget https://developer.nvidia.com/downloads/remksjetpack-463r32releasev73t186jetsonlinur3273aarch64tbz2 && \ tar xf remksjetpack-463r32releasev73t186jetsonlinur3273aarch64tbz2 && rm remksjetpack-463r32releasev73t186jetsonlinur3273aarch64tbz2 && \ cd Linux_for_Tegra && \ sed -i 's/config.tbz2\"/config.tbz2\" --exclude=etc\/hosts --exclude=etc\/hostname/g' apply_binaries.sh && \ sed -i 's/install --owner=root --group=root \"${QEMU_BIN}\" \"${L4T_ROOTFS_DIR}\/usr\/bin\/\"/#install --owner=root --group=root \"${QEMU_BIN}\" \"${L4T_ROOTFS_DIR}\/usr\/bin\/\"/g' nv_tegra/nv-apply-debs.sh && \ sed -i 's/chroot . \// /g' nv_tegra/nv-apply-debs.sh && \ ./apply_binaries.sh -r / --target-overlay && cd .. \ rm -rf Linux_for_Tegra && \ echo "/usr/lib/aarch64-linux-gnu/tegra" > /etc/ld.so.conf.d/nvidia-tegra.conf && ldconfig 

CMD [ "sleep", "infinity" ]

(All of these examples are available in the balena-jetson-catalog repository.)

Now when we run deviceQuery, we get (in part):

Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X2"
CUDA Driver Version / Runtime Version 10.2 / 10.2
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS

Success! We are now running a container from NVIDIA NGC on balenaOS without the Container Runtime.

Note that we can get some clues about the device nodes, files, and folders that are merged into the container via the Container Runtime by inspecting the CSV files in /etc/nvidia-container-runtime/host-files-for-container.d/ on our TX2 running Jetson OS. You can also try running an example with the ldd command to see which shared libraries are being accessed. This information can be handy later when you need to install missing dependencies (ones that are usually mapped from the host) into the container to run on balenaOS.

For the smokeParticles example above, ldd returned:

linux-vdso.so.1 (0x0000007fa49dd000)
libGL.so.1 => /usr/lib/aarch64-linux-gnu/libGL.so.1 (0x0000007fa44c4000)
libGLU.so.1 => /usr/lib/aarch64-linux-gnu/libGLU.so.1 (0x0000007fa4452000)
libglut.so.3 => /usr/lib/aarch64-linux-gnu/libglut.so.3 (0x0000007fa43ff000)
librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000007fa43e8000)
libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000007fa43bc000)
libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000007fa43a7000)
libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000007fa4213000)
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000007fa415a000)
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000007fa4136000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007fa3fdd000)
/lib/ld-linux-aarch64.so.1 (0x0000007fa49b1000)
libGLX.so.0 => /usr/lib/aarch64-linux-gnu/libGLX.so.0 (0x0000007fa3f9d000)
libGLdispatch.so.0 => /usr/lib/aarch64-linux-gnu/libGLdispatch.so.0 (0x0000007fa3e71000)
libX11.so.6 => /usr/lib/aarch64-linux-gnu/libX11.so.6 (0x0000007fa3d48000)
libXi.so.6 => /usr/lib/aarch64-linux-gnu/libXi.so.6 (0x0000007fa3d2a000)
libXxf86vm.so.1 => /usr/lib/aarch64-linux-gnu/libXxf86vm.so.1 (0x0000007fa3d15000)
libxcb.so.1 => /usr/lib/aarch64-linux-gnu/libxcb.so.1 (0x0000007fa3ce5000)
libXext.so.6 => /usr/lib/aarch64-linux-gnu/libXext.so.6 (0x0000007fa3cc5000)
libXau.so.6 => /usr/lib/aarch64-linux-gnu/libXau.so.6 (0x0000007fa3cb2000)
libXdmcp.so.6 => /usr/lib/aarch64-linux-gnu/libXdmcp.so.6 (0x0000007fa3c9d000)
libbsd.so.0 => /lib/aarch64-linux-gnu/libbsd.so.0 (0x0000007fa3c7b000)

Additional containers

Below is a quick summary of the other containers we have running on balenaOS…

TensorRT

With the TensorRT container, we’ll also download and install BSP binaries for L4T 32.7.3 similar to the previous container. We can test the container’s GPU access by compiling and running the sampleOnnxMNIST example which when executed provides (in part) the following output:

&&&& RUNNING TensorRT.sample_onnx_mnist [TensorRT v8201] # ./sample_onnx_mnist --datadir /usr/src/app/tensorrt-samples/data/mnist
[04/12/2023-02:38:25] [I] Building and running a GPU inference engine for Onnx MNIST
[04/12/2023-02:38:26] [I] [TRT] [MemUsageChange] Init CUDA: CPU +240, GPU +0, now: CPU 258, GPU 6335 (MiB)
[04/12/2023-02:38:26] [I] [TRT] ----------------------------------------------------------------
…
[04/12/2023-02:38:31] [I] @@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@+ :@@@@@@@@
@@@@@@@@@@@@@@%= :. --%@@@@@
@@@@@@@@@@@@@%. -@= - :@@@@@
@@@@@@@@@@@@@: -@@#%@@ #@@@@
@@@@@@@@@@@@: #@@@@@@@-#@@@@
@@@@@@@@@@@= #@@@@@@@@=%@@@@
@@@@@@@@@@= #@@@@@@@@@:@@@@@
@@@@@@@@@+ -@@@@@@@@@%.@@@@@
@@@@@@@@@::@@@@@@@@@@+-@@@@@
@@@@@@@@-.%@@@@@@@@@@.@@@@@
@@@@@@@@ @@@@@@@@@@@ @@@@@
@@@@@@@% %@@@@@@@@@%.-@@@@@@
@@@@@@@:@@@@@@@@@+. %@@@@@@
@@@@@@# @@@@@@@@@# .@@@@@@@
@@@@@@# @@@@@@@@= +@@@@@@@@
@@@@@@# @@@@@@%. .+@@@@@@@@@
@@@@@@# @@@@@. -%@@@@@@@@@@
@@@@@@# --- =@@@@@@@@@@@@
@@@@@@# *%@@@@@@@@@@@@@
@@@@@@@%: -=%@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[04/12/2023-02:38:31] [I] Output:
[04/12/2023-02:38:31] [I] Prob 0 0.9998 Class 0: ****
[04/12/2023-02:38:31] [I] Prob 1 0.0000 Class 1:
[04/12/2023-02:38:31] [I] Prob 2 0.0000 Class 2:
[04/12/2023-02:38:31] [I] Prob 3 0.0000 Class 3:
[04/12/2023-02:38:31] [I] Prob 4 0.0000 Class 4:
[04/12/2023-02:38:31] [I] Prob 5 0.0000 Class 5:
[04/12/2023-02:38:31] [I] Prob 6 0.0002 Class 6:
[04/12/2023-02:38:31] [I] Prob 7 0.0000 Class 7:
[04/12/2023-02:38:31] [I] Prob 8 0.0000 Class 8:
[04/12/2023-02:38:31] [I] Prob 9 0.0000 Class 9:
[04/12/2023-02:38:31] [I]
[04/12/2023-02:38:31] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 914, GPU 7002 (MiB)
&&&& PASSED TensorRT.sample_onnx_mnist [TensorRT v8201] # ./sample_onnx_mnist --datadir /usr/src/app/tensorrt-samples/data/mnist

PyTorch

Next we’ll take the PyTorch container, add the L4T binaries, and run a PyTorch test on balenaOS which returns the following errors:

OSError: libcurand.so.10: cannot open shared object file: No such file or directory

OSError: /usr/lib/aarch64-linux-gnu/libcudnn.so.8: file too short

The libcurand file is part of the CUDA Toolkit and libcudnn is part of CuDNN, so we’ll install both of those packages in our container. To do that, we’ll also need to add the proper apt sources for NVIDIA. (See the final Dockerfile here)
Remember to use the same version of these packages as was installed by the SDK Manager when we were using Jetson OS.

Now our container is properly accessing the Tegra GPU:

Checking for CUDA device(s) for PyTorch...
Torch CUDA available: True
Torch CUDA current device: 0
Torch CUDA device info: 
Torch CUDA device count: 1
Torch CUDA device name: NVIDIA Tegra X2

TensorFlow

For the NVIDIA L4T TensorFlow container we’ll also install the CUDA Toolkit and CuDNN to avoid similar error messages as we experienced with the PyTorch container. Running the included TensorFlow example returns (in part) the following results, confirming that the container is performing calculations on the GPU running balenaOS:

root@31b21c30ed70:/usr/src# python3 tf-test.py

Num GPUs Available:  1

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

<function is_built_with_cuda at 0x7f67f3a510>

2023-04-15 02:47:31.705430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /device:GPU:0 with 5014 MB memory:  -> device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2

/device:GPU:0

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

2023-04-15 02:47:31.722144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5014 MB memory:  -> device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2

2023-04-15 02:47:34.167682: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0

2023-04-15 02:47:34.170760: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0

2023-04-15 02:47:34.184665: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0

tf.Tensor(

[[22. 28.]

 [49. 64.]], shape=(2, 2), dtype=float32)

2023-04-15 02:47:35.958929: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0

tf.Tensor(

[[22. 28.]

 [49. 64.]], shape=(2, 2), dtype=float32)

Going further

So far, we’ve looked at examples of running containers from the NVIDIA NGC Catalog on a Jetson TX2 with L4T 32.7.3. However, the same process outlined above can be followed to see if containers for newer Jetson devices and updated versions of JetPack will also work on balenaOS. In fact, some more recent containers from the Catalog no longer expect CUDA and other dependencies to be installed on the host, potentially making the conversion to running on balenaOS even easier. In an upcoming follow up to this post, we’ll look at some options for NGC containers using a Jetson AGX Orin.

The repository for this guide has all of the Dockerfiles and examples listed above. You can use them as-is, build further functionality on top of them, or use them as a guide to get other official containers working on balenaOS.

Here’s a summary of the steps from above, note that not all of them will be necessary but there may be some trial and error required:

Choose a container that is based on the same JetPack version as the balenaOS host
Install the appropriate BSP binaries in the container for the JetPack version used by the container and host
Make the container privileged otherwise run the container on Jetson OS and use strace to see which device nodes you need to map individually
If you receive errors about missing files or dependencies when running on balenaOS, install them into the container
If all else fails, try running the container on Jetson OS and inspect the csv file or use ldd as described above to find files that are expected by the container

Feedback

Let us know if you have successfully run these or other containers from the NGC Catalog on balenaOS. Were the steps outlined here useful or do you have any to add? Please let us know in the comments below or contact us in our forums.