Prometheus Integration¶
Detailed guide for monitoring ArqonHPO with Prometheus.
Enable Metrics¶
CLI¶
Dashboard Server¶
Metrics available at: http://127.0.0.1:9898/metrics
Prometheus Configuration¶
Add to prometheus.yml:
scrape_configs:
- job_name: "arqonhpo"
static_configs:
- targets: ["localhost:9898"]
scrape_interval: 15s
metrics_path: /metrics
Available Metrics¶
Counters¶
# Total operations
arqonhpo_ask_calls_total
arqonhpo_tell_calls_total
arqonhpo_candidates_emitted_total
arqonhpo_results_ingested_total
# Safety events
arqonhpo_violations_total{type="DeltaTooLarge"}
arqonhpo_violations_total{type="RateLimitExceeded"}
arqonhpo_violations_total{type="DirectionFlipViolation"}
arqonhpo_rollbacks_total
Gauges¶
# Current state
arqonhpo_history_len
arqonhpo_budget_remaining
arqonhpo_best_value
arqonhpo_config_generation
# SPSA state
arqonhpo_spsa_iteration
arqonhpo_spsa_learning_rate
arqonhpo_spsa_perturbation_scale
# Safety
arqonhpo_safe_mode_active # 1 = in safe mode
Histograms¶
# Latency distributions (seconds)
arqonhpo_eval_duration_seconds_bucket{le="0.001"}
arqonhpo_eval_duration_seconds_bucket{le="0.01"}
arqonhpo_eval_duration_seconds_bucket{le="0.1"}
arqonhpo_eval_duration_seconds_bucket{le="1"}
arqonhpo_eval_duration_seconds_bucket{le="10"}
arqonhpo_eval_duration_seconds_bucket{le="+Inf"}
arqonhpo_ask_duration_seconds_bucket
arqonhpo_apply_duration_seconds_bucket
PromQL Examples¶
Throughput¶
# Evaluations per second
rate(arqonhpo_tell_calls_total[5m])
# Candidates per second
rate(arqonhpo_candidates_emitted_total[5m])
Latency¶
# 50th percentile eval latency
histogram_quantile(0.5, rate(arqonhpo_eval_duration_seconds_bucket[5m]))
# 99th percentile eval latency
histogram_quantile(0.99, rate(arqonhpo_eval_duration_seconds_bucket[5m]))
# Average ask latency
rate(arqonhpo_ask_duration_seconds_sum[5m]) / rate(arqonhpo_ask_duration_seconds_count[5m])
Safety¶
# Violation rate
rate(arqonhpo_violations_total[5m])
# Violations by type
sum by (type) (rate(arqonhpo_violations_total[5m]))
# Rollback frequency
increase(arqonhpo_rollbacks_total[1h])
# Safe mode duration
changes(arqonhpo_safe_mode_active[1h])
Progress¶
# Budget consumption rate
rate(arqonhpo_history_len[5m])
# Time to completion (estimate)
arqonhpo_budget_remaining / rate(arqonhpo_history_len[5m])
Grafana Dashboard¶
Import JSON¶
{
"title": "ArqonHPO Monitoring",
"panels": [
{
"title": "Throughput",
"type": "graph",
"targets": [
{
"expr": "rate(arqonhpo_tell_calls_total[5m])",
"legendFormat": "evals/s"
}
]
},
{
"title": "Best Objective",
"type": "stat",
"targets": [{ "expr": "arqonhpo_best_value", "legendFormat": "best" }]
},
{
"title": "Eval Latency p99",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(arqonhpo_eval_duration_seconds_bucket[5m]))",
"legendFormat": "p99"
}
]
},
{
"title": "Violations",
"type": "graph",
"targets": [
{
"expr": "sum by (type) (rate(arqonhpo_violations_total[5m]))",
"legendFormat": "{{type}}"
}
]
},
{
"title": "Budget Progress",
"type": "gauge",
"targets": [
{
"expr": "1 - (arqonhpo_budget_remaining / (arqonhpo_history_len + arqonhpo_budget_remaining))",
"legendFormat": "progress"
}
]
},
{
"title": "Safe Mode",
"type": "stat",
"targets": [
{ "expr": "arqonhpo_safe_mode_active", "legendFormat": "active" }
],
"valueMappings": [
{ "value": 0, "text": "Normal", "color": "green" },
{ "value": 1, "text": "SAFE MODE", "color": "red" }
]
}
]
}
Alerting Rules¶
Add to Prometheus rules:
groups:
- name: arqonhpo
rules:
- alert: ArqonHighViolationRate
expr: rate(arqonhpo_violations_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High violation rate in ArqonHPO"
description: "{{ $value }} violations/s over 5 minutes"
- alert: ArqonSafeModeActive
expr: arqonhpo_safe_mode_active == 1
for: 1m
labels:
severity: critical
annotations:
summary: "ArqonHPO in safe mode"
description: "Optimization paused due to safety trigger"
- alert: ArqonHighEvalLatency
expr: histogram_quantile(0.99, rate(arqonhpo_eval_duration_seconds_bucket[5m])) > 60
for: 10m
labels:
severity: warning
annotations:
summary: "High evaluation latency"
description: "p99 latency {{ $value }}s"
- alert: ArqonStalled
expr: rate(arqonhpo_tell_calls_total[30m]) == 0
for: 30m
labels:
severity: warning
annotations:
summary: "ArqonHPO appears stalled"
description: "No evaluations in 30 minutes"
Kubernetes ServiceMonitor¶
For Prometheus Operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: arqonhpo
labels:
app: arqonhpo
spec:
selector:
matchLabels:
app: arqonhpo
endpoints:
- port: metrics
interval: 15s
path: /metrics
Next Steps¶
- Observability Overview — All observability features
- Dashboard — Web UI
- Kubernetes — K8s deployment