3 min read
NVIDIA Container Runtime May Cause Dynamic Library Resolution Issues

Another boring day. I received feedback from a user who couldn’t execute the curl command in his container.

it was a symbol lookup error , undefined curl_mime_free

i create a container with the same image on my machone and run the same command and it worked fine.

This doesn’t make sense intuitively. If the same image is used, at least the file system should be identical.

To investigate further, checked the dynamic libraries linked by the curl executable in both environments using the ldd command.

then found the issue, the curl executable in the user’s container was linked to the libcurl.so which path is different from the one in my container.

rerun the ldd command with the LD_DEBUG environment variable set to libs to see how the dynamic linker resolves the dynamic libraries.

it seems that the entry of libcurl.so in ld.so.cache was different in the two environments.

but why? i use the same image and the same run command arguments, if something triggers the ld.so.cache update , it should be the same in both environments.

I suddenly realized that the user’s container was created with the NVIDIA container runtime, while mine was created with the default container runtime.

The --runtime parameter was overlooked by me because the user’s dockerd was configured to use NVIDIA’s runtime by default.

In order to find when the ld.so.cache was updated, I checked out the source code of nvidia container toolkit

the ldconfig was executed in the function nvc_ldcache_update with default libs_dir /usr/lib/x86_64-linux-gnu

int
nvc_ldcache_update(struct nvc_context *ctx, const struct nvc_container *cnt)
{
        // ...
        argv = (char * []){cnt->cfg.ldconfig, "-f", "/etc/ld.so.conf", "-C", "/etc/ld.so.cache", cnt->cfg.libs_dir, cnt->cfg.libs32_dir, NULL};
        // ...
}

that’s why the libcurl.so in the user’s container was linked to the wrong path.

the nvc_ldcache_update function will called when nvidia-container-cli configure subcommand executed.

the nvidia-container-cli configure subcommand was invoked by command nvidia-container-runtime-hook’s prestart hook.

the nvidia-container-runtime-hook was executed by the nvidia-container-runtime which was configured in the daemon.json file.

{ 
 "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

enough, my curiosity was satisfied. the solution is simple, just run ldconfig again inside the container and do not specify the libs_dir parameter.