Hi there,
We are trying to use cgroupv2 enabled GPU nodes in a OKE, but found that nvidia-device-plugin daemonset failed to start. This is working for cgroupv1.
The full context:
- we are using GPU enabled nodes in OKE, and follow the guide: https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengcreatingcgroupv2enabledworkernodes.htm to create a custom image for the GPU nodes. The base image is
Oracle-Linux-8.10-Gen2-GPU-2025.03.18-0-OKE-1.32.1-763
- The node can be provisioned successfully, however, we found that there is no GPU device detected for that node, which is incurred by the crashed nvidia device plugin:
nvidia-container-cli: container error: cgroup subsystem devices not found
.
- During the investigation, we found that the
cgroupv2
support in OCI does not expose the devices
fs. which is a required thing fro nvidia-device-plugin
, this controll group does exist for cgroupv1
- We tried to follow this guide to add the
devices
control group but found it is fully blocked: https://docs.oracle.com/en/operating-systems/oracle-linux/8/boot/cgroups-CgroupsV2AppResMgt.html?source=%3Aow%3Ams%3Apt%3A%3A, and we are getting Invalid Argument
.
Could you please suggest whether this is supported in OCI or not? Both cgroupv2
and GPU
support is essential in our use case and hope OCI can expose the target devices.