Cloud Security, Observability and Administration

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Prometheus Service Discovery for OCI via HTTP SD (Production-Scale Implementation)

Amaan Ul Haq Siddiqui27 hours ago

Oracle Cloud Infrastructure currently does not provide native service discovery support in Prometheus, unlike other major cloud platforms. There has been an upstream effort to address this, but it has not yet been integrated.

To close this gap in a production environment, I implemented a service discovery layer using Prometheus’ HTTP Service Discovery interface, without modifying Prometheus itself.

Approach

The solution is a lightweight service that:

Integrates with OCI APIs across multiple tenancies and compartments
Filters compute instances using defined tagging conventions
Exposes dynamically generated scrape targets via an HTTP SD endpoint

This allows Prometheus to remain unchanged while enabling OCI-native discovery behavior.

Operational Model

The design goal was to make observability enrollment deterministic and low-friction:

Instance is provisioned with a discovery tag (e.g. prometheus:scrape=true)
Network access to the metrics endpoint is defined at the security layer
The instance is automatically discovered and scraped within the next cycle

This removes configuration drift and eliminates manual intervention as infrastructure scales.

Implementation Considerations

Key aspects that were necessary for reliability in OCI environments:

Multi-tenancy support
Single deployment handling discovery across isolated tenancies and compartments
API-aware rate limiting and retries
Designed around OCI API behavior to avoid throttling-related gaps in discovery
Caching layer
Ensures consistent target responses during transient OCI API latency or failures
Security boundaries
Token-based access, internal-only exposure, and minimal runtime footprint

Production Context

This approach is currently running across multiple OCI tenancies and regions in a production setting. It has been stable under typical operational conditions, including API throttling scenarios and IAM policy edge cases.

From an operational perspective, the outcome is predictable:
new compute instances become observable automatically, without requiring updates to Prometheus configuration or deployment workflows.

References

GitHub: https://github.com/amaanx86/oci-prometheus-sd-proxy
Documentation: https://oci-prometheus-sd-proxy.readthedocs.io
Detailed write-up: https://amaanx86.github.io/blog/oci-prometheus-service-discovery/

Closing

Sharing this as a practical approach for teams running Prometheus on OCI today.

Also interested in whether there are ongoing internal or upstream efforts to standardize OCI service discovery within Prometheus, or if others have taken different approaches in production.

Added 27 hours ago

#devops, #monitoring, #observability, #oracle-cloud, #prometheus

0 comments

47 views