Ultimate VOS Reference Architecture¶

This document captures the target architecture for Arqon Maestro as the universal Voice OS substrate for the Arqon ecosystem.

It is intentionally a living design document. It records the current strategic shape, the target-state runtime boundaries, and the architectural decisions that should guide iteration. It does not claim that every component described here is already implemented.

Candidate Stack Options (Exploratory)¶

The stack options discussed during architecture exploration are possibilities only. They are not approved defaults, implementation commitments, or final decisions for Arqon Maestro.

They should be treated as candidate directions to evaluate against runtime evidence, operational constraints, and governance requirements before adoption.

Purpose¶

Arqon Maestro should not evolve into a generic voice assistant.

It should evolve into a voice-native operating substrate with:

a deterministic reflex lane for fast, validated commands
a cognitive lane for agentic reasoning and multi-step orchestration
ArqonMCP as the centralized command fabric and policy core
constitutive integrity gates before high-impact execution
a swappable shell, speech, and voice-output stack
multi-agent voice identity as a first-class UX and systems concern

Design Principles¶

Voice is an OS control fabric, not a chatbot skin.
Deterministic execution outranks fashionable model-centric design.
The hot path stays local, low-latency, and interruption-safe.
Speech input, reasoning, and voice output are separate contracts.
No single vendor, shell, or TTS engine may become a hard lock-in.
Agent identity includes voice identity.
High-impact actions fail closed when truth or policy is uncertain.
ArqonMCP is the unified control plane, but hot execution should remain distributable.
MCP may be JSON-RPC at the edge, but internal Arqon contracts should trend protobuf-first.

Maestro Identity¶

Arqon Maestro should be treated as an AGO whose identity is the Voice Operating System.

Its job is to answer:

how speech becomes command
how spoken command becomes governed action
how wake, barge-in, dictation, and mode switching behave
how users operate software, code, and tools by voice

This is a narrower and stronger identity than "assistant."

Architectural rule:

Maestro is the spoken operating substrate
it should not collapse into a generic personal assistant

Maestro And Nexus¶

If Arqon Nexus emerges as the intelligent personal assistant AGO, then the clean relationship is:

Maestro = the Voice Operating System
Nexus = the intelligent personal assistant

These should be modeled as sibling AGOs and co-processors on Arqon Bus.

Maestro should own:

the hot voice-operating path
interruption-safe command handling
spoken operating grammar
deterministic execution handoff

Nexus should own:

personal context
assistant continuity
long-horizon planning
contextual suggestions
agentic guidance

Architectural rule:

Nexus must not swallow Maestro
Maestro must not try to absorb the whole assistant role

The clean division is:

Maestro speaks, hears, commands, and operates. Nexus knows, assists, remembers, and guides.

Current Anchors¶

This reference architecture is grounded in the current repository and runtime evidence:

legacy local voice path through core, speech-engine, and code-engine
Arqon Bus transport on 9100
ArqonMCP integration as the intended skill and command substrate
address-first routing and CFH work already documented in the voice plane
integrity handshake and fail-closed review behavior
control-plane coordinator with idempotency and arbitration
provider-based TTS abstraction with Kokoro sidecar direction

These anchors must be preserved while the system evolves.

Near-Term Named Defaults¶

The following stack elements are explicitly selected as near-term defaults for Maestro:

Wake word: openWakeWord
VAD: Silero VAD
STT (command-fast): whisper.cpp as leading local default
STT (dictation-accurate): separate benchmark-driven provider track (not locked)
TTS: Kokoro (primary) with Piper as local fallback

These defaults are implementation targets, not a permanent lock-in. They still sit behind provider/service contracts so components can evolve without rewriting core runtime boundaries.

Architectural Thesis¶

The core split is:

Reflex Lane: fast, deterministic, grammar-driven, interrupt-safe
Cognitive Lane: planner-driven, tool-using, memory-aware, slower

The system should route every utterance into one of a small number of execution classes:

reflex command
structured command
cognitive request
dictation
ignore / ambient

This classification boundary is one of the main moats of the system.

Revised Architectural Core¶

The correct future-state framing is not:

voice stack -> tools

It is:

voice ingress plane -> ArqonMCP -> execution fabric -> speech / UI response

ArqonMCP should be treated as:

the command fabric
the routing authority
the governance boundary
the skill and version registry
the provenance and rollback spine

Voice is therefore a front-end plane feeding a governed execution kernel.

Top-Level Runtime¶

flowchart LR
  Mic[Microphone] --> Audio[maestro-audio]
  Audio --> Turn[maestro-turn]
  Turn --> STT[maestro-stt]
  STT --> Adapter[maestro-voice-adapter]
  Adapter --> ReflexArbiter[privileged reflex arbiter]
  ReflexArbiter --> MCP[ArqonMCP]

  MCP --> Sense[Sense]
  Sense --> Sentinel[Sentinel]
  Sentinel --> Ladder[L0/L1/L2/L3/L4 ladder]
  Ladder --> Zero[Zero execution]
  Ladder --> Cortex[Cortex compiler]
  Zero --> Executor[maestro-executor]
  Cortex --> Executor
  Executor --> Targets[Editors / Desktop / Tools / Agents]

  Zero --> Memory[maestro-memory]
  Cortex --> Memory
  Memory --> TTSBroker[maestro-tts-broker]
  TTSBroker --> Speaker[Speaker output]

  Audio -. barge-in .-> TTSBroker

Service Map¶

`maestro-shell`¶

Owns the desktop shell, system tray behavior, permissions UX, active-app awareness, settings, and operator-facing control surfaces.

Target direction:

migrate from Electron to Tauri shell
keep a thin shell layer
move hot-path runtime concerns out of the shell

Why:

Electron is a useful recovery shell and shipping vehicle
Tauri is a better long-term shell for a hardened local Voice OS because it reduces runtime bulk and aligns better with a Rust-heavy systems core

Architectural rule:

shell migration must not rewrite the voice runtime contracts
shell is a host, not the brainstem

`maestro-audio`¶

Owns:

microphone capture
PCM framing
timestamps
device selection
gain normalization
optional denoise / echo cleanup

This should be treated as a hot-path systems component and should trend Rust-native.

`maestro-turn`¶

Owns:

speech start / stop inference
turn completion
interruption detection
barge-in
cancel / stop priority handling
distinction between dictation pauses and command completion

This is a first-class service, not a helper.

Weak turn logic makes a voice system feel stupid even when the models are strong.

`maestro-stt`¶

Owns speech recognition under at least two explicit profiles:

command-fast
dictation-accurate

Future optional profiles:

noisy-environment
low-power
secure-speaker-verified

Near-term baseline:

command-fast should target whisper.cpp
dictation-accurate remains provider-flexible and benchmark-selected

Architectural rule:

STT is a provider contract, not a single permanent engine choice

`maestro-voice-adapter`¶

Owns the translation from speech output into a structured ingress envelope for ArqonMCP.

It should attach:

transcript
partials
acoustic confidence
session id
speaker id when available
mode
timestamp range
interruption flag
dictation versus command hypothesis
probable route hint

Architectural rule:

voice input should enter ArqonMCP as structured metadata, not as naked text

`privileged reflex arbiter`¶

Owns the smallest and fastest class of privileged voice reflexes:

stop
cancel
undo
mute
sleep
wake
pause
resume
push-to-talk controls
mode switches

Architectural rule:

these should not pay the full skill-routing cost
this is the shortest possible path in the system

`ArqonMCP`¶

Owns the centralized command fabric.

It should be treated as the core operating substrate of Maestro rather than as a side integration.

Core responsibilities:

MCP ingress and request lifecycle
internal command-envelope normalization
routing policy
skill discovery and version selection
provenance and rollback hooks
deterministic execution preference
escalation to Cortex only when needed

Architectural rule:

ArqonMCP is logically centralized for policy and semantics
hot execution may still be distributed and cached locally

`Sense`¶

Owns low-overhead request shaping and contract attachment.

Responsibilities:

AgentEnvelope
BindingContract
routing hints
latency budgets
LLM allowance flags
confirmation and reversibility requirements

`Sentinel`¶

Owns constitutional safety and governance enforcement.

Responsibilities:

capability checks
cost and rate policy
destructive action policy
environment-state gating
speaker / operator authorization checks where required

`speed ladder`¶

Owns tiered routing through:

L0: exact cache
L1: SAS / Reflex address lookup
L2: pattern FSM
L3: classifier
L4: Cortex compiler

Architectural rule:

tiers should be treated as provenance sources, not just accuracy hacks

`maestro-router`¶

Owns the top-level user-intent classes before and around ArqonMCP ingress:

wake / addressed-to-system
reflex command
structured command
cognitive request
dictation
ambient ignore

Architectural rule:

the router must prefer address-first resolution and deterministic grammar paths before escalating to cognitive reasoning

This lane should classify command, dictation, and conversation distinctly rather than collapsing them into one routing path.

`maestro-reflex`¶

Owns grammar-driven and address-first resolution on the deterministic path that feeds ArqonMCP and the execution plane.

This lane should absorb what is currently strongest in Serenade and extend it into broader OS control.

`maestro-cortex`¶

Owns:

multi-step planning
explanation
tool composition
research
agentic task execution
delegation across named agents

Architectural rule:

this lane must not sit in front of all commands
it is invoked when needed, not by default
Cortex is for novelty, synthesis, and skill creation; it is not the default route for operating commands.

`maestro-integrity`¶

Owns constitutive review of actions that may be:

destructive
high-trust
security-sensitive
externally consequential

The existing fail-closed action review pattern should remain the standard.

`maestro-coordinator`¶

Owns:

arbitration under contention
per-agent fairness
idempotency
retries
dead-lettering
fail-closed refusal when backend authority is unavailable

This becomes more important as the system gains more agents and more voice-triggered concurrency.

`maestro-executor`¶

Owns actuation across three classes:

editor executor
desktop executor
tool / agent executor

Each class should have its own rollback and approval semantics.

`maestro-memory`¶

Owns:

session continuity
episodic memory
operator preferences
agent state snapshots
provenance
evidence trails
reversible checkpoints

This is where the Voice OS becomes persistent and trustworthy rather than stateless and theatrical.

`maestro-tts-broker`¶

Owns:

provider selection
streaming TTS playback
fallback policy
interruption-safe playback
voice identity resolution
persona / voice routing rules

Architectural rule:

the OS must not be tied to one TTS engine

Kokoro may be a premier default, but the broker contract is the durable asset.

ArqonMCP Core Flow¶

Within ArqonMCP, the preferred flow should be:

voice envelope
-> MCP ingress
-> internal protobuf envelope
-> Sense
-> Sentinel
-> privileged reflex check if still applicable
-> L0 exact
-> L1 SAS
-> L2 pattern FSM
-> L3 classifier
-> L4 Cortex if contract allows
-> Zero / executor
-> provenance + rollback trace
-> UI/TTS/memory update

This is the actual command kernel of the Voice OS.

Process Boundaries¶

The runtime should be separated into four zones.

1. Desktop Host Zone¶

maestro-shell
plugin bridge
operator settings UI

2. Hot Path Local Runtime¶

maestro-audio
maestro-turn
maestro-voice-adapter
privileged reflex arbiter
maestro-router
maestro-reflex
maestro-tts-broker

This zone must remain responsive under load and should minimize dependency sprawl.

3. Governed Execution Zone¶

ArqonMCP
Sense
Sentinel
maestro-integrity
maestro-coordinator
maestro-executor
maestro-memory

This zone owns truth, policy, arbitration, and reversibility.

4. Heavy / Swappable Compute Zone¶

maestro-stt
maestro-cortex
TTS providers
optional diarization / denoise / voice-cloning tooling

These components may evolve independently as long as they satisfy the contracts above them.

Primary Message Flows¶

Reflex Command¶

audio -> turn detection -> fast STT -> voice adapter -> privileged reflex arbiter -> ArqonMCP deterministic ladder -> executor -> short acknowledgment

Examples:

undo that
open terminal
run cargo build
focus editor

Cognitive Request¶

audio -> turn detection -> accurate STT -> voice adapter -> ArqonMCP -> Sense -> Sentinel -> L4 Cortex if allowed -> executor -> voiced response

Examples:

compare these modules and refactor the parser safely
explain this error and propose the least risky fix

Dictation¶

audio -> turn detection tuned for dictation -> accurate STT -> voice adapter -> dictation route -> editor executor

Barge-In¶

audio during playback -> interruption detection -> immediate TTS stop -> router prioritizes cancel/override/reflex

Internal Protocol Rule¶

External MCP compatibility may require JSON-RPC at the edge.

Internal Arqon transport should convert as early as possible into a protobuf envelope and should stay protobuf-first afterward.

Why:

stronger schema contracts
lower serialization overhead
cleaner Rust/Python interoperability
better auditability and replay
clearer event evolution over time

Command Modes¶

The Voice OS should maintain explicit top-level modes:

asleep
listening
command
dictation
coding
conversational
secure
presentation
silent-acknowledgment

Mode state should affect:

routing
allowed skills
TTS behavior
confirmation requirements
accepted speakers and security posture

Confirmation And Reversibility¶

Confirmation and undo should be platform primitives, not ad hoc skill behavior.

Every skill should declare:

reversible or not
rollback strategy
rollback window
compensation action
confirmation tier

Suggested confirmation tiers:

none
implicit
spoken
typed
secure / privileged

The phrase undo that should work as broadly as possible across the platform.

Skill Families And Versioning¶

Normal routing should not go directly to raw versioned skills.

It should go through:

IntentKey -> SkillFamily -> ActiveVersionSelector -> VersionedSkill

Each SkillFamily should carry:

stable skill id
active version pointer
canonical address
exemplar address set
alias address set
slot skeleton signatures
parameter schema
policy metadata

This allows append-only versioning, rollback, and audit without making routing brittle.

Routing Confidence¶

SAS should not be forced to imitate cosine similarity.

Confidence should instead be a routing-provenance object built from evidence such as:

tier that matched
exact versus alias hit
parameter completeness
speaker and mode compatibility
historical execution success
active version status
policy fit
contract compliance

This produces a better ZERO/HYBRID/CORTEX decision than pretending every route is a similarity score.

Shell Strategy: Electron To Tauri¶

This should be treated as an explicit architectural migration, not a cosmetic rewrite.

Decision¶

The long-term shell target is Tauri.

Why¶

better fit for a Rust-heavy systems core
lower shell overhead
cleaner long-term posture for a local-first desktop Voice OS
less pressure to keep runtime-critical logic in Node/Electron land

Migration Constraints¶

Preserve current working voice path while migrating.
Keep bus contracts stable.
Keep settings compatibility during transition.
Do not fuse shell migration with a full runtime rewrite.
Move hot-path logic into reusable services first, then swap shells.

Migration Shape¶

Phase 1:

treat Electron as compatibility shell
isolate shell-specific concerns from voice runtime services

Phase 2:

move audio, turn-taking, and TTS broker responsibilities behind stable local contracts

Phase 3:

introduce Tauri shell over the same contracts

Phase 4:

retire Electron once feature and operational parity are proven

Voice Output Strategy: Multi-Voice By Design¶

This must be a first-class architecture decision.

The OS should never assume a single canonical voice.

Core Voice Identity Model¶

Voice output should be resolved through three separate concepts:

provider
the engine that synthesizes speech
examples: Kokoro, fallback provider, future providers
voice
a concrete voice asset offered by a provider
examples: af_heart or another provider-specific speaker ID
persona
a higher-level semantic identity mapped to an actual provider voice
examples: default_system, architect_agent, research_agent, sentinel_agent

The OS should route by persona, not by raw provider voice IDs.

Required Voice Classes¶

The architecture should explicitly support:

one user-selected default voice for common non-agentic communication
a library of agent voices for named agents
optional context-specific voices for alerts, confirmations, and critical warnings

Examples:

default system voice
architect voice
operator / governance voice
research voice
memory / continuity voice
warning / sentinel voice

Policy¶

The user should control:

default common voice
voice pack enablement
per-agent voice overrides
whether certain categories use distinct voices or collapse to the default

The system should control:

safe fallback when a preferred voice is unavailable
consistency of named agent voice identity across sessions
enforcement of voice routing rules for critical event categories

Architectural Rule¶

Agent voices are not a cosmetic add-on.

They are part of:

cognitive legibility
multi-agent coordination UX
trust separation
fast recognition of who is speaking

If multiple agents operate in a shared spoken environment, each one should be audibly distinguishable.

TTS Broker Contract¶

Every TTS provider integrated into the Voice OS should satisfy as many of these as possible:

local or local-sidecar execution
streaming chunk output
low time-to-first-audio
interruption-safe playback
deterministic request IDs
telemetry hooks
voice enumeration
stable voice identifiers
configurable timeout and fallback policy

Required TTS Broker Capabilities¶

speak(message, persona, priority, interruptible)
stop(playback_id)
listProviders()
listVoices(provider)
resolvePersona(persona)
setDefaultVoice(voice_profile_id)
setAgentVoice(agent_id, voice_profile_id)

Voice Profile Registry¶

The system should maintain a registry of voice profiles with fields such as:

voice_profile_id
provider
voice_id
persona_tags
style_tags
latency_class
quality_class
available
fallback_voice_profile_id

This registry is the correct place to support a large voice library without coupling the OS to one provider.

Multi-Agent Voice Policy¶

Each named agent should be able to speak with its own assigned persona voice.

That implies:

voice identity is part of agent registration
the routing layer can request a response from a specific agent persona
the TTS broker resolves persona to a live provider voice
fallback preserves distinction where possible

Example:

user_default -> calm neutral voice
architect_agent -> analytical voice
research_agent -> exploratory voice
sentinel_agent -> terse warning voice
memory_agent -> reflective continuity voice

The user may still choose to collapse all agent voices to the default voice, but that should be a preference, not the system default.

Voice Packs¶

The architecture should support installable or declarative voice packs.

A voice pack should be able to provide:

provider mappings
recommended persona assignments
quality / latency metadata
fallback chains
preview samples

This allows the Voice OS to grow a large voice library without hardcoding all voices into the core runtime.

Hard Architectural Requirements¶

The target system should preserve the following:

Reflex/cognitive split
Fail-closed governance for high-impact actions
Swappable shell
Swappable STT
Swappable TTS
Multi-agent voice identity
Replayability and evidence trails
Strong rollback paths
Local-first hot path
Deterministic routing before agentic escalation

Near-Term Evolution Priorities¶

Promote this architecture into an explicit roadmap and implementation plan.
Define stable local contracts for audio, turn detection, routing, and TTS broker behavior.
Design the Tauri migration around those contracts instead of around UI rewrites.
Introduce persona-based voice routing and a voice profile registry.
Expand the TTS abstraction from provider switching into full voice-library management.
Harden barge-in and turn-taking as first-class runtime behavior.

Open Design Questions¶

These questions should be refined in follow-up iterations:

what exact service boundary should exist between shell and hot-path runtime?
should the TTS broker run in-process with the shell or as its own local service?
how should voice packs be packaged, versioned, and signed?
how much of the voice profile registry belongs in user config versus system config?
when multiple agents speak in sequence, what arbitration and interruption rules apply?
what latency budget is acceptable for agent-voice responses versus reflex acknowledgments?
how should secure or privileged modes change accepted speakers, voices, and approval flows?

Summary¶

Arqon Maestro should become a Voice OS substrate with:

a thin desktop shell, ultimately Tauri-based
a local hot path for audio, turn-taking, routing, and speech output
deterministic reflex execution
governed cognitive execution
a provider-agnostic voice layer
a large persona-driven voice library for agents

The system should not be optimized around one shell or one model.

It should be optimized around durable contracts, fast local operation, constitutive reliability, and unmistakable multi-agent voice identity.

Ultimate VOS Reference Architecture¶

Candidate Stack Options (Exploratory)¶

Purpose¶

Design Principles¶

Maestro Identity¶

Maestro And Nexus¶

Current Anchors¶

Near-Term Named Defaults¶

Architectural Thesis¶

Revised Architectural Core¶

Top-Level Runtime¶

Service Map¶

maestro-shell¶

maestro-audio¶

maestro-turn¶

maestro-stt¶

maestro-voice-adapter¶

privileged reflex arbiter¶

ArqonMCP¶

Sense¶

Sentinel¶

speed ladder¶

maestro-router¶

maestro-reflex¶

maestro-cortex¶

maestro-integrity¶

maestro-coordinator¶

maestro-executor¶

maestro-memory¶

maestro-tts-broker¶

ArqonMCP Core Flow¶

Process Boundaries¶

1. Desktop Host Zone¶

2. Hot Path Local Runtime¶

3. Governed Execution Zone¶

4. Heavy / Swappable Compute Zone¶

Primary Message Flows¶

Reflex Command¶

Cognitive Request¶

Dictation¶

Barge-In¶

Internal Protocol Rule¶

Command Modes¶

Confirmation And Reversibility¶

Skill Families And Versioning¶

Routing Confidence¶

Shell Strategy: Electron To Tauri¶

Decision¶

Why¶

Migration Constraints¶

Migration Shape¶

Voice Output Strategy: Multi-Voice By Design¶

Core Voice Identity Model¶

Required Voice Classes¶

Policy¶

Architectural Rule¶

TTS Broker Contract¶

Required TTS Broker Capabilities¶

Voice Profile Registry¶

Multi-Agent Voice Policy¶

Voice Packs¶

Hard Architectural Requirements¶

Near-Term Evolution Priorities¶

Open Design Questions¶

Summary¶

`maestro-shell`¶

`maestro-audio`¶

`maestro-turn`¶

`maestro-stt`¶

`maestro-voice-adapter`¶

`privileged reflex arbiter`¶

`ArqonMCP`¶

`Sense`¶

`Sentinel`¶

`speed ladder`¶

`maestro-router`¶

`maestro-reflex`¶

`maestro-cortex`¶

`maestro-integrity`¶

`maestro-coordinator`¶

`maestro-executor`¶

`maestro-memory`¶

`maestro-tts-broker`¶