Observability¶
ArqonHPO provides comprehensive observability through structured logs, Prometheus metrics, and real-time dashboards.
Structured Logs¶
The CLI uses tracing and emits structured logs to stderr.
Enable JSON Output¶
Log Levels¶
| Level | Content |
|---|---|
error | Failures only |
warn | Violations, guardrail triggers |
info | Ask/tell events, phase changes |
debug | Strategy decisions, proposal details |
trace | SPSA iterations, config snapshots |
Log Fields¶
| Field | Description |
|---|---|
command | Current CLI command |
config | Path to config file |
state | Path to state file |
artifact | Path to artifact file |
phase | Current PCR phase |
iteration | SPSA iteration count |
generation | Config generation number |
Prometheus Metrics¶
Enable metrics endpoint:
arqonhpo --metrics-addr 127.0.0.1:9898 <command>
# Or with dashboard
arqonhpo dashboard --state state.json --metrics-addr 127.0.0.1:9898
Scrape at http://127.0.0.1:9898/metrics.
Counters¶
| Metric | Labels | Description |
|---|---|---|
arqonhpo_ask_calls_total | — | Total ask() invocations |
arqonhpo_tell_calls_total | — | Total tell() invocations |
arqonhpo_candidates_emitted_total | — | Total candidates generated |
arqonhpo_results_ingested_total | — | Total results processed |
arqonhpo_violations_total | type | Safety violations by type |
arqonhpo_rollbacks_total | — | Rollback operations |
Gauges¶
| Metric | Labels | Description |
|---|---|---|
arqonhpo_history_len | — | Current history size |
arqonhpo_budget_remaining | — | Remaining evaluation budget |
arqonhpo_best_value | — | Current best objective value |
arqonhpo_config_generation | — | Current config generation |
arqonhpo_spsa_iteration | — | Current SPSA iteration |
arqonhpo_safe_mode_active | — | 1 if in safe mode, 0 otherwise |
Histograms¶
| Metric | Buckets | Description |
|---|---|---|
arqonhpo_eval_duration_seconds | 0.001, 0.01, 0.1, 1, 10 | Evaluation latency |
arqonhpo_ask_duration_seconds | 0.0001, 0.001, 0.01, 0.1 | Ask latency |
arqonhpo_apply_duration_seconds | 0.00001, 0.0001, 0.001 | Config apply latency |
Example Queries (PromQL)¶
# Average evaluations per second
rate(arqonhpo_tell_calls_total[5m])
# 99th percentile eval latency
histogram_quantile(0.99, rate(arqonhpo_eval_duration_seconds_bucket[5m]))
# Violation rate by type
rate(arqonhpo_violations_total[5m])
# Config update frequency
rate(arqonhpo_config_generation[1m])
# Safe mode duration
changes(arqonhpo_safe_mode_active[1h])
Grafana Dashboard¶
Import this JSON or build from these panels:
Recommended Panels¶
- Throughput —
rate(arqonhpo_tell_calls_total[5m]) - Best Value —
arqonhpo_best_value - Eval Latency p99 —
histogram_quantile(0.99, ...) - Violations —
rate(arqonhpo_violations_total[5m])by type - Budget Remaining —
arqonhpo_budget_remaining - Safe Mode Status —
arqonhpo_safe_mode_active
Alert Rules¶
groups:
- name: arqonhpo
rules:
- alert: HighViolationRate
expr: rate(arqonhpo_violations_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High violation rate in ArqonHPO"
- alert: SafeModeActive
expr: arqonhpo_safe_mode_active == 1
for: 1m
labels:
severity: critical
annotations:
summary: "ArqonHPO entered safe mode"
Tracing Spans¶
When log-level=trace, spans are emitted:
| Span | Parent | Description |
|---|---|---|
ask | — | Full ask() operation |
probe | ask | Probe phase sampling |
classify | ask | Landscape classification |
refine | ask | Strategy execution |
tell | — | Full tell() operation |
apply | tell | Config application |
guardrails | apply | Safety checks |
OpenTelemetry (Planned)¶
OTel export planned for v0.4. Track: Issue #XX
TUI Monitoring¶
Real-time terminal dashboard:
See TUI Reference for details.
Web Dashboard¶
Browser-based monitoring:
See Dashboard Reference for API endpoints.
Python Observability Patterns¶
The ArqonHPO solver logic runs in-memory. To enable real-time monitoring via the TUI or Dashboard (which watch the filesystem), you must explicitly "mirror" the solver's state to disk.
The "State Mirror" Pattern¶
Use the dump_state utility provided by the SDK to broadcast internal state updates.
import arqonhpo
# In your optimization loop:
while (batch := solver.ask()):
# ... evaluate batch ...
# Mirror state to disk for TUI/Dashboard
arqonhpo.dump_state(
config=config,
history=history,
run_id="my-run-001",
delay=0.1 # Optional: slight delay for smooth TUI visualization
)
solver.tell(results)
[!TIP] Human-in-the-Loop: Adding a small
delay(e.g., 0.1s - 0.5s) in the loop makes the visualization smooth enough for humans to follow the optimization journey in real-time.
Invariant: Minimization Only!¶
[!WARNING] ArqonHPO is a Minimizer. It strictly seeks the lowest possible objective value. If your metric is logical "Score" or "Accuracy" (where higher is better), you MUST invert it before reporting.
The Maximization Trap: If you report accuracy (e.g., 0.95), the solver will try to minimize it towards 0.0.
Correct Usage: Use arqonhpo.invert_score or manually negate.
Audit Events¶
All safety-relevant events are logged to the audit trail:
| Event | Trigger |
|---|---|
apply_success | Config update applied |
apply_rejected | Proposal violated guardrails |
rollback | Reverted to baseline |
safe_mode_enter | Entered safe mode |
safe_mode_exit | Exited safe mode |
baseline_set | New baseline established |
Access via:
- Dashboard API:
GET /api/events - CLI export:
arqonhpo export --state state.json
Next Steps¶
- Dashboard — Web UI reference
- Hotpath API — Internal telemetry APIs
- Safety — Understanding violations