CUDA Unknown Error with Podman | Thoughts | Yin Jun, Phua -- Assistant Professor at Tokyo Tech

As I was setting up my new deep learning workstation with Debian 12, I found out that Nvidia now supports Podman. With all kinds of bad rumors surrounding Docker, I decided to try out Podman.

Once I had successfully navigated through the installation process and set up CDI for Podman, I was able to get a CUDA container up and running. Running nvidia-smi in the container confirmed that the GPU is available. However, when I tried to run any CUDA program, it resulted in an "Unknown error" message. The message itself contained no useful information, and I had no lead to even start troubleshooting this.

At first, I thought that maybe there were some limitations with the CUDA version on the host and the container. But despite running the same CUDA version, it still didn't work. Even turning to Google didn't yield anything helpful.

I had another Debian machine running CUDA in a Docker container just fine, so I figured I might as well check if there were any differences in the container. I stumbled upon a Github issue that was for a totally different problem, but it suggested that not having the correct device files under /dev can cause CUDA to fail. This prompted me to look further into it.

Interestingly enough, there was a difference. The /dev folder in the Docker container included /dev/nvidia-uvm and /dev/nvidia-uvm-tools, whereas the Podman container did not. I realized that the nvidia-uvm kernel module wasn't loaded on the new workstation. So I tried to modprobe nvidia-uvm and check to see if it would solve the issue... and it didn't.

Turns out, manually loading the nvidia-uvm kernel module doesn't create the device files. So I found this script on Nvidia's website that will create the necessary device files, and ran the relevant commands.

# grep nvidia-uvm /proc/devices
<device-id> nvidia-uvm
# mknod -m 666 /dev/nvidia-uvm c <device-id> 0
# mknod -m 666 /dev/nvidia-uvm-tools c <device-id> 1

Then, after regenerating the CDI configuration file, it finally worked.

nvidia-uvm seems to be loaded automatically when a program initializes CUDA. But for whatever reason, this wasn't happening with programs running within the Podman container. I have yet to determine whether this is down to the rootless container or needing the --privileged flag, but for now, manually loading the kernel module and creating the device files seems to fix the issue.

Have comments or want to have discussions about the contents of this post? You can always contact me by email.