Skip to Main Content

Cloud Security, Observability and Administration

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Prometheus Service Discovery for OCI via HTTP SD (Production-Scale Implementation)

Oracle Cloud Infrastructure currently does not provide native service discovery support in Prometheus, unlike other major cloud platforms. There has been an upstream effort to address this, but it has not yet been integrated.

To close this gap in a production environment, I implemented a service discovery layer using Prometheus’ HTTP Service Discovery interface, without modifying Prometheus itself.

Approach

The solution is a lightweight service that:

  • Integrates with OCI APIs across multiple tenancies and compartments
  • Filters compute instances using defined tagging conventions
  • Exposes dynamically generated scrape targets via an HTTP SD endpoint

This allows Prometheus to remain unchanged while enabling OCI-native discovery behavior.

Operational Model

The design goal was to make observability enrollment deterministic and low-friction:

  1. Instance is provisioned with a discovery tag (e.g. prometheus:scrape=true)
  2. Network access to the metrics endpoint is defined at the security layer
  3. The instance is automatically discovered and scraped within the next cycle

This removes configuration drift and eliminates manual intervention as infrastructure scales.

Implementation Considerations

Key aspects that were necessary for reliability in OCI environments:

  • Multi-tenancy support
    Single deployment handling discovery across isolated tenancies and compartments
  • API-aware rate limiting and retries
    Designed around OCI API behavior to avoid throttling-related gaps in discovery
  • Caching layer
    Ensures consistent target responses during transient OCI API latency or failures
  • Security boundaries
    Token-based access, internal-only exposure, and minimal runtime footprint

Production Context

This approach is currently running across multiple OCI tenancies and regions in a production setting. It has been stable under typical operational conditions, including API throttling scenarios and IAM policy edge cases.

From an operational perspective, the outcome is predictable:
new compute instances become observable automatically, without requiring updates to Prometheus configuration or deployment workflows.

References

Closing

Sharing this as a practical approach for teams running Prometheus on OCI today.

Also interested in whether there are ongoing internal or upstream efforts to standardize OCI service discovery within Prometheus, or if others have taken different approaches in production.

Comments
Post Details
Added 27 hours ago
0 comments
47 views