Prometheus Monitoring
imgforge exposes Prometheus-compatible metrics so you can observe throughput, latency, cache efficacy, and error rates in real time. This guide shows you how to scrape the service, interpret the provided counters and histograms, and build actionable alerts.
Exposing the endpoint
- Default listener –
/metricsis served on the main HTTP listener. SetIMGFORGE_BIND(default0.0.0.0:3000) to match your environment. - Dedicated listener – Provide
IMGFORGE_PROMETHEUS_BIND(for example0.0.0.0:9600) to expose metrics on a separate port. The endpoint remains/metrics. - Authentication – The metrics endpoint never requires URL signatures but inherits bearer-token protection when
IMGFORGE_SECRETis set. Grant your scraper a token or whitelist the Prometheus network path at the proxy layer.
Metrics flow diagram
┌──────────────────────────────────────────────────────────────────────────┐
│ Prometheus Metrics Pipeline │
└──────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ │
│ Request arrives ──▶ Timer starts │
│ │ │
│ ├──▶ Check cache ──▶ cache_hits_total++ OR │
│ │ cache_misses_total++ │
│ │ │
│ ├──▶ Fetch source ──▶ source_image_fetch_duration_ │
│ │ seconds (observe) │
│ │ │
│ ├──▶ Transform ──▶ image_processing_duration_seconds │
│ │ (observe) │
│ │ │
│ └──▶ Response ──▶ http_requests_duration_seconds │
│ status_codes_total{status="200"}++ │
│ processed_images_total{format="..."}++│
└─────────────────────────────────────────────────────────────────┘Core metrics
| Metric name | Type | Labels | Insight |
|---|---|---|---|
http_requests_duration_seconds | Histogram | method, path | Latency across the full request lifecycle, including cache hits and misses. |
image_processing_duration_seconds | Histogram | format | Time spent transforming images, segmented by requested output format. |
processed_images_total | Counter | format | Throughput per encoded format; increments on successful responses. |
source_image_fetch_duration_seconds | Histogram | none | Download latency from upstream sources. |
source_images_fetched_total | Counter | status | Counts of successful (status="success") and failed (status="error") source fetches. |
cache_hits_total / cache_misses_total | Counter | cache_type | Cache effectiveness across memory, disk, or hybrid backends. |
status_codes_total | Counter | status | Aggregated HTTP responses (ideal for alerting on spikes in 4xx/5xx). |
Tip: Combine counters into rates using
rate()orirate()when graphing over time, and applyhistogram_quantile()to histogram buckets for percentile views.
Example Prometheus configuration
scrape_configs:
- job_name: imgforge
static_configs:
- targets: ["imgforge.example.com:3000"]
metrics_path: /metrics
scheme: https
authorization:
credentials: ${IMGFORGE_PROM_TOKEN}When running a dedicated metrics listener, adjust targets to the alternate port. Use service discovery (Kubernetes, Consul, etc.) in production to track dynamic endpoints.
Suggested dashboards
- Request overview – Plot
sum(rate(status_codes_total[5m])) by (status)to visualise success versus error responses. - Processing latency – Use
histogram_quantile(0.95, sum(rate(image_processing_duration_seconds_bucket[5m])) by (le, format))to watch for regressions after deploys. - Cache efficiency – Visualize hit ratio:
sum(rate(cache_hits_total[5m])) / (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m]))). - Source reliability – Track
sum(rate(source_images_fetched_total{status="error"}[5m]))to spot upstream outages. - Instance saturation – Overlay CPU, memory, and worker semaphore utilisation (exported via node/system exporters) with imgforge latency histograms to understand headroom.
Alerting patterns
- Error spike – Trigger when
sum(rate(status_codes_total{status=~"5.."}[5m]))exceeds a baseline for 10 minutes. - Cache miss surge – Alert when the miss ratio stays above 70% for sustained intervals, indicating cache warmup or configuration drift.
- Slow processing – Page when the 95th percentile of
image_processing_duration_secondsremains above an agreed SLA for 15 minutes. - Source failures – Notify when
rate(source_images_fetched_total{status="error"}[5m])climbs, hinting at upstream instability.
Connecting with tracing and logs
Correlate request IDs emitted in logs (via the X-Request-ID header) with spikes in histogram buckets. Pair this document with Request Lifecycle to map metrics anomalies back to lifecycle stages, and with infrastructure metrics (CPU, memory, I/O) for holistic visibility.
Next steps
- Surface these dashboards in Grafana or your preferred visualization tool.
- Feed alerts into your incident workflow (PagerDuty, Opsgenie, Slack). Use quiet hours and grouping strategies to avoid alert fatigue.
- Share runbooks linking playbooks in Error Troubleshooting so responders can remediate issues quickly.