Skip to content

Arqon Runtime Optimizer

Inference

Initializing search

novelbytelabs/ArqonHPO

Home
Why
Demos
Benchmarks
Docs
ArqonShip
Community
Project

Arqon Runtime Optimizer

novelbytelabs/ArqonHPO

Home
Why
Why
- Product Overview
- Architecture
Demos
Demos
- Solutions Overview
- Inference Inference
  Table of contents
- Systems
- Edge
Benchmarks
Benchmarks
- Benchmarks
- Methodology
Docs
Docs
- Index
- Installation
- Quickstart
- Concepts
  Concepts
- Reference
  Reference
  - Python API
  - Rust Core
  - Hotpath API
  - CLI
  - TUI
  - Dashboard
  - Observability
  - Error Codes
- Cookbook
  Cookbook
  - Cookbook
  - ML Tuning
  - Sim Tuning
  - Warm Starting
  - Kubernetes
  - FastAPI
  - Ray Tune
  - MLflow
- Integrations
  Integrations
  - Integrations
  - Prometheus
- Migration
- Troubleshooting
- FAQ
ArqonShip
ArqonShip
Community
Project
Project
- About
- Roadmap
- Changelog
- Contributing
- License
- CI/CD Runbook
- Test Coverage
- Constitution

Table of contents

The Challenge
What to Optimize
Example: Dynamic Batch Sizing
Integration Points
Safety for Production
Next Steps

Inference Serving¶

ArqonHPO optimizes LLM serving infrastructure in real-time, adapting to traffic patterns and model behavior.

The Challenge¶

LLM inference costs dominate AI budgets. Static configurations leave performance on the table:

Batch size too small = GPU underutilized
Batch size too large = p99 latency blows SLA
KV-cache miss = expensive recomputation

ArqonHPO continuously tunes these parameters as traffic shifts.

What to Optimize¶

Parameter	Impact
`batch_size`	Throughput vs. latency
`max_tokens`	Memory vs. completion length
`kv_cache_size`	Hit rate vs. memory
`speculative_decoding_k`	Speed vs. accuracy
`router_weights`	Model selection (small vs. large)

Example: Dynamic Batch Sizing¶

import json
from arqonhpo import ArqonSolver

config = {
    "seed": 42,
    "budget": 10000,
    "bounds": {
        "batch_size": {"min": 1, "max": 64, "scale": "Log"},
        "max_wait_ms": {"min": 10, "max": 500}
    }
}

solver = ArqonSolver(json.dumps(config))

while True:
    candidate = solver.ask_one()
    if candidate is None:
        break

    # Apply new batch config
    serving_engine.set_batch_config(
        batch_size=int(candidate["batch_size"]),
        max_wait_ms=candidate["max_wait_ms"]
    )

    # Measure over 60 seconds
    metrics = serving_engine.collect_metrics(duration_s=60)

    # Objective: maximize throughput while keeping p99 < SLA
    if metrics.p99_latency_ms > SLA_MS:
        reward = -1000  # Penalty for SLA violation
    else:
        reward = -metrics.cost_per_token  # Minimize cost

    solver.seed(json.dumps([{
        "params": candidate,
        "value": reward,
        "cost": 60.0  # Cost = seconds spent
    }]))

Integration Points¶

vLLM¶

# Tune PagedAttention parameters
config["bounds"]["block_size"] = {"min": 8, "max": 32}
config["bounds"]["gpu_memory_utilization"] = {"min": 0.7, "max": 0.95}

TensorRT-LLM¶

# Tune inflight batching
config["bounds"]["max_batch_size"] = {"min": 1, "max": 256}
config["bounds"]["max_queue_delay_ms"] = {"min": 1, "max": 100}

Triton Inference Server¶

# Tune dynamic batching
config["bounds"]["preferred_batch_size"] = {"min": 1, "max": 128}
config["bounds"]["max_queue_delay_microseconds"] = {"min": 100, "max": 10000}

Safety for Production¶

Use Guardrails to prevent catastrophic configurations:

{
  "bounds": [
    [1, 64],
    [10, 500]
  ],
  "max_delta": [8, 100],
  "max_updates_per_second": 1.0
}

max_delta prevents wild swings in batch size
rate limit prevents oscillation

Next Steps¶

Batch vs Online Mode
Safety Deep Dive
Observability

January 10, 2026 December 14, 2025