Skip to Main Content

Containers, Cloud Native & Kubernetes

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

nvidia-device-plugin failed to start on OKE with cgroupv2 enabled

Yuan ZhuangJun 3 2025

Hi there,

We are trying to use cgroupv2 enabled GPU nodes in a OKE, but found that nvidia-device-plugin daemonset failed to start. This is working for cgroupv1.

The full context:

  1. we are using GPU enabled nodes in OKE, and follow the guide: https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengcreatingcgroupv2enabledworkernodes.htm to create a custom image for the GPU nodes. The base image is Oracle-Linux-8.10-Gen2-GPU-2025.03.18-0-OKE-1.32.1-763
  2. The node can be provisioned successfully, however, we found that there is no GPU device detected for that node, which is incurred by the crashed nvidia device plugin: nvidia-container-cli: container error: cgroup subsystem devices not found.
  3. During the investigation, we found that the cgroupv2 support in OCI does not expose the devices fs. which is a required thing fro nvidia-device-plugin, this controll group does exist for cgroupv1
  4. We tried to follow this guide to add the devices control group but found it is fully blocked: https://docs.oracle.com/en/operating-systems/oracle-linux/8/boot/cgroups-CgroupsV2AppResMgt.html?source=%3Aow%3Ams%3Apt%3A%3A, and we are getting Invalid Argument.

Could you please suggest whether this is supported in OCI or not? Both cgroupv2 and GPU support is essential in our use case and hope OCI can expose the target devices.

Comments
Post Details
Added on Jun 3 2025
0 comments
130 views