Observability¶

ArqonHPO provides comprehensive observability through structured logs, Prometheus metrics, and real-time dashboards.

Structured Logs¶

The CLI uses tracing and emits structured logs to stderr.

Enable JSON Output¶

arqonhpo --log-format json --log-level info <command>

Log Levels¶

Level	Content
`error`	Failures only
`warn`	Violations, guardrail triggers
`info`	Ask/tell events, phase changes
`debug`	Strategy decisions, proposal details
`trace`	SPSA iterations, config snapshots

Log Fields¶

Field	Description
`command`	Current CLI command
`config`	Path to config file
`state`	Path to state file
`artifact`	Path to artifact file
`phase`	Current PCR phase
`iteration`	SPSA iteration count
`generation`	Config generation number

Prometheus Metrics¶

Enable metrics endpoint:

arqonhpo --metrics-addr 127.0.0.1:9898 <command>

# Or with dashboard
arqonhpo dashboard --state state.json --metrics-addr 127.0.0.1:9898

Scrape at http://127.0.0.1:9898/metrics.

Counters¶

Metric	Labels	Description
`arqonhpo_ask_calls_total`	—	Total ask() invocations
`arqonhpo_tell_calls_total`	—	Total tell() invocations
`arqonhpo_candidates_emitted_total`	—	Total candidates generated
`arqonhpo_results_ingested_total`	—	Total results processed
`arqonhpo_violations_total`	`type`	Safety violations by type
`arqonhpo_rollbacks_total`	—	Rollback operations

Gauges¶

Metric	Labels	Description
`arqonhpo_history_len`	—	Current history size
`arqonhpo_budget_remaining`	—	Remaining evaluation budget
`arqonhpo_best_value`	—	Current best objective value
`arqonhpo_config_generation`	—	Current config generation
`arqonhpo_spsa_iteration`	—	Current SPSA iteration
`arqonhpo_safe_mode_active`	—	1 if in safe mode, 0 otherwise

Histograms¶

Metric	Buckets	Description
`arqonhpo_eval_duration_seconds`	0.001, 0.01, 0.1, 1, 10	Evaluation latency
`arqonhpo_ask_duration_seconds`	0.0001, 0.001, 0.01, 0.1	Ask latency
`arqonhpo_apply_duration_seconds`	0.00001, 0.0001, 0.001	Config apply latency

Example Queries (PromQL)¶

# Average evaluations per second
rate(arqonhpo_tell_calls_total[5m])

# 99th percentile eval latency
histogram_quantile(0.99, rate(arqonhpo_eval_duration_seconds_bucket[5m]))

# Violation rate by type
rate(arqonhpo_violations_total[5m])

# Config update frequency
rate(arqonhpo_config_generation[1m])

# Safe mode duration
changes(arqonhpo_safe_mode_active[1h])

Grafana Dashboard¶

Import this JSON or build from these panels:

Recommended Panels¶

Throughput — rate(arqonhpo_tell_calls_total[5m])
Best Value — arqonhpo_best_value
Eval Latency p99 — histogram_quantile(0.99, ...)
Violations — rate(arqonhpo_violations_total[5m]) by type
Budget Remaining — arqonhpo_budget_remaining
Safe Mode Status — arqonhpo_safe_mode_active

Alert Rules¶

groups:
  - name: arqonhpo
    rules:
      - alert: HighViolationRate
        expr: rate(arqonhpo_violations_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High violation rate in ArqonHPO"

      - alert: SafeModeActive
        expr: arqonhpo_safe_mode_active == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "ArqonHPO entered safe mode"

Tracing Spans¶

When log-level=trace, spans are emitted:

Span	Parent	Description
`ask`	—	Full ask() operation
`probe`	`ask`	Probe phase sampling
`classify`	`ask`	Landscape classification
`refine`	`ask`	Strategy execution
`tell`	—	Full tell() operation
`apply`	`tell`	Config application
`guardrails`	`apply`	Safety checks

OpenTelemetry (Planned)¶

OTel export planned for v0.4. Track: Issue #XX

TUI Monitoring¶

Real-time terminal dashboard:

arqonhpo tui --state state.json --events events.jsonl

See TUI Reference for details.

Web Dashboard¶

Browser-based monitoring:

arqonhpo dashboard --state state.json --addr 127.0.0.1:3030

See Dashboard Reference for API endpoints.

Python Observability Patterns¶

The ArqonHPO solver logic runs in-memory. To enable real-time monitoring via the TUI or Dashboard (which watch the filesystem), you must explicitly "mirror" the solver's state to disk.

The "State Mirror" Pattern¶

Use the dump_state utility provided by the SDK to broadcast internal state updates.

import arqonhpo

# In your optimization loop:
while (batch := solver.ask()):
    # ... evaluate batch ...

    # Mirror state to disk for TUI/Dashboard
    arqonhpo.dump_state(
        config=config, 
        history=history, 
        run_id="my-run-001",
        delay=0.1  # Optional: slight delay for smooth TUI visualization
    )

    solver.tell(results)

[!TIP] Human-in-the-Loop: Adding a small delay (e.g., 0.1s - 0.5s) in the loop makes the visualization smooth enough for humans to follow the optimization journey in real-time.

Invariant: Minimization Only!¶

[!WARNING] ArqonHPO is a Minimizer. It strictly seeks the lowest possible objective value. If your metric is logical "Score" or "Accuracy" (where higher is better), you MUST invert it before reporting.

The Maximization Trap: If you report accuracy (e.g., 0.95), the solver will try to minimize it towards 0.0.

Correct Usage: Use arqonhpo.invert_score or manually negate.

accuracy = 0.95
loss = arqonhpo.invert_score(accuracy) # Returns 0.05
# OR
loss = -accuracy

Audit Events¶

All safety-relevant events are logged to the audit trail:

Event	Trigger
`apply_success`	Config update applied
`apply_rejected`	Proposal violated guardrails
`rollback`	Reverted to baseline
`safe_mode_enter`	Entered safe mode
`safe_mode_exit`	Exited safe mode
`baseline_set`	New baseline established

Access via:

Dashboard API: GET /api/events
CLI export: arqonhpo export --state state.json

Next Steps¶

Dashboard — Web UI reference
Hotpath API — Internal telemetry APIs
Safety — Understanding violations