NVIDIA Container Toolkit Vulnerabilities

Executive Summary

In July 2025, NVIDIA disclosed two serious vulnerabilities impacting its Container Toolkit and GPU Operator components. These issues affect systems running GPU workloads within containerized environments, posing critical risks such as container breakout, arbitrary code execution, and privilege escalation.

The vulnerabilities reside in runtime hooks used during container startup and shared library linking. Exploitation of these flaws allows malicious actors with access to execute containers to potentially gain root-level privileges on the host or disrupt system operations.

Vulnerability Breakdown

The first vulnerability, identified as CVE-2025-23266, carries a CVSS score of 9.0, categorizing it as critical. It affects the enable-cuda-compat hook, a feature used during container runtime initialization. A malicious container image can abuse this hook to execute arbitrary code outside the intended container boundary. If exploited, this can lead to full host compromise, unauthorized file manipulation, or denial-of-service (DoS) conditions. The attacker needs only limited privileges and access to deploy a container image within a shared cluster or environment.

The second vulnerability, CVE-2025-23267, has a CVSS score of 8.5, marking it as high severity. This flaw exists in the update-ldcache hook, which can be tricked by specially crafted symbolic links. This enables attackers to conduct link-following attacks, allowing them to overwrite sensitive files or trigger unintended behavior during dynamic linking processes.

Impacted Versions

Systems using NVIDIA Container Toolkit version 1.17.7 or earlier and NVIDIA GPU Operator version 25.3.0 or earlier are vulnerable to both CVEs.

Patch and Remediation

NVIDIA has released patched versions to address these vulnerabilities. The Container Toolkit has been updated to version 1.17.8, and the GPU Operator has been updated to version 25.3.1. These versions include critical security fixes that mitigate both vulnerabilities.

Immediate upgrade to the latest versions is strongly recommended, especially in production environments with shared GPU access or multi-tenant workloads.

Temporary Workarounds

For environments where immediate patching is not feasible, NVIDIA has provided a temporary mitigation strategy:

The vulnerable enable-cuda-compat hook can be disabled.
- For systems using the legacy NVIDIA Container Runtime, this is done by modifying the config.toml file located at /etc/nvidia-container-toolkit/. You must add or edit the following: [features] disable-cuda-compat-lib-hook = true
- For Kubernetes deployments using the GPU Operator via Helm, you can set the environment variable during installation or upgrade: --set "toolkit.env[0].name=NVIDIA_CONTAINER_TOOLKIT_OPT_IN_FEATURES" \ --set "toolkit.env[0].value=disable-cuda-compat-lib-hook"
Additionally, Helm users can specify the exact version of the toolkit to use during deployment: --set "toolkit.version=1.17.8"

These actions will effectively neutralize the vulnerable hooks until a permanent update is applied.

Historical Context

This is not the first time NVIDIA’s container ecosystem has faced security concerns. In February 2025, CVE-2025-23359 highlighted a Time-of-Check to Time-of-Use (TOCTOU) vulnerability that allowed similar host access through unsafe file operations. Similarly, CVE-2024-0132, disclosed in late 2024, exposed a flaw that could allow container breakout via race conditions. These cases underline the persistent risks around dynamic hook injection and container lifecycle management in NVIDIA’s stack.

Security Recommendations

Upgrade immediately to Container Toolkit 1.17.8 and GPU Operator 25.3.1 or newer.
Disable the cuda-compat hook if patching is delayed.
Restrict container execution capabilities in shared environments using Kubernetes RBAC and PodSecurity standards.
Avoid untrusted container images, especially community-maintained or public images with NVIDIA bases.
Implement image signing and verification in CI/CD pipelines to ensure integrity and authenticity.
Isolate GPU workloads on dedicated nodes using node selectors or taints to prevent lateral movement from compromised containers.
Regularly audit NVIDIA container hooks and runtime permissions for signs of tampering or unusual behavior.

Conclusion

The vulnerabilities disclosed represent serious threats to GPU-enabled container workloads, especially in cloud-native and AI/ML environments. Since these attacks exploit the very foundation of container-to-host interactions, they demand immediate attention. Updating to the patched versions or applying the prescribed mitigations is critical for maintaining operational security and system integrity.