Oracle Cloud Infrastructure currently does not provide native service discovery support in Prometheus, unlike other major cloud platforms. There has been an upstream effort to address this, but it has not yet been integrated.
To close this gap in a production environment, I implemented a service discovery layer using Prometheus’ HTTP Service Discovery interface, without modifying Prometheus itself.
Approach
The solution is a lightweight service that:
- Integrates with OCI APIs across multiple tenancies and compartments
- Filters compute instances using defined tagging conventions
- Exposes dynamically generated scrape targets via an HTTP SD endpoint
This allows Prometheus to remain unchanged while enabling OCI-native discovery behavior.
Operational Model
The design goal was to make observability enrollment deterministic and low-friction:
- Instance is provisioned with a discovery tag (e.g.
prometheus:scrape=true)
- Network access to the metrics endpoint is defined at the security layer
- The instance is automatically discovered and scraped within the next cycle
This removes configuration drift and eliminates manual intervention as infrastructure scales.
Implementation Considerations
Key aspects that were necessary for reliability in OCI environments:
- Multi-tenancy support
Single deployment handling discovery across isolated tenancies and compartments
- API-aware rate limiting and retries
Designed around OCI API behavior to avoid throttling-related gaps in discovery
- Caching layer
Ensures consistent target responses during transient OCI API latency or failures
- Security boundaries
Token-based access, internal-only exposure, and minimal runtime footprint
Production Context
This approach is currently running across multiple OCI tenancies and regions in a production setting. It has been stable under typical operational conditions, including API throttling scenarios and IAM policy edge cases.
From an operational perspective, the outcome is predictable:
new compute instances become observable automatically, without requiring updates to Prometheus configuration or deployment workflows.
References
Closing
Sharing this as a practical approach for teams running Prometheus on OCI today.
Also interested in whether there are ongoing internal or upstream efforts to standardize OCI service discovery within Prometheus, or if others have taken different approaches in production.