Hi there,
We are trying to use cgroupv2 enabled GPU nodes in a OKE, but found that nvidia-device-plugin daemonset failed to start. This is working for cgroupv1.
The full context:
- we are using GPU enabled nodes in OKE, and follow the guide: https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengcreatingcgroupv2enabledworkernodes.htm to create a custom image for the GPU nodes. The base image is
Oracle-Linux-8.10-Gen2-GPU-2025.03.18-0-OKE-1.32.1-763
- The node can be provisioned successfully, however, we found that there is no GPU device detected for that node, which is incurred by the crashed nvidia device plugin:
nvidia-container-cli: container error: cgroup subsystem devices not found.
- During the investigation, we found that the
cgroupv2 support in OCI does not expose the devices fs. which is a required thing fro nvidia-device-plugin, this controll group does exist for cgroupv1
- We tried to follow this guide to add the
devices control group but found it is fully blocked: https://docs.oracle.com/en/operating-systems/oracle-linux/8/boot/cgroups-CgroupsV2AppResMgt.html?source=%3Aow%3Ams%3Apt%3A%3A, and we are getting Invalid Argument.
Could you please suggest whether this is supported in OCI or not? Both cgroupv2 and GPU support is essential in our use case and hope OCI can expose the target devices.