The assistant runtime
Build a voice assistant on smart glasses with glasses.assistant — wake/sleep, tools, vision, barge-in, memory, on the managed AI gateway. Phase-4 preview.
The assistant runtime (glasses.assistant.*) is the canonical way to build a voice AI on Extentos. It wraps an end-to-end speech-to-speech provider — OpenAI Realtime by default — so the model itself owns wake detection, turn-taking, intent parsing, and confirmation speech. Your code shrinks to one block: declare instructions, register tool(name, description) { ... } bodies that act on your app's own state, and wire a wake trigger. The model decides which tool to call from each tool's natural-language description; there is no keyword routing, no when (transcript) ladder, no spec file.
Preview snapshot — not on Maven Central yet. The assistant runtime ships in the
1.4.0-phase4-dogfoodpreview snapshot, resolved viamavenLocal()— it is not on Maven Central (the published Android SDK is1.3.0, which does not includeglasses.assistant.*). Build and publish the snapshot locally to dogfood it; see SDK install. iOS is pending — the Swift port of the wake/sleep state machine is in flight and not at parity. This page documents the Android surface.
The shape of an assistant
import com.extentos.glasses.core.assistant.AssistantProvider
import com.extentos.glasses.core.assistant.AssistantSession
import com.extentos.glasses.core.assistant.ToolResult
import com.extentos.glasses.core.assistant.tool
class RunCompanion(private val glasses: ExtentosGlasses, private val scope: CoroutineScope) {
private var session: AssistantSession? = null
fun start() = scope.launch {
session = glasses.assistant.start(
// model + voice come from your dashboard Agent settings; pass them
// here only to hard-pin in code (a code value wins over the dashboard).
// Defaults if neither is set: gpt-realtime-2 + alloy.
provider = AssistantProvider.OpenAi(),
) {
instructions = "You are a running companion. Speak briefly — they're running."
tool("get_pace", "The runner's current average pace in minutes per km.") {
ToolResult.Ok("${'$'}{routeTracker.avgPaceMinKm()} min per km")
}
}
// Any trigger that calls session.wake() works — here, a wake phrase.
glasses.voice.onPhrase("hey coach") { session?.wake() }
}
}Two registration forms exist and produce identical behavior:
- Sugar —
glasses.assistant.start(provider) { ... }. The trailing lambda is anAssistantConfigBuilder; idiomatic for the common case. It creates a session and starts it for you. - Raw —
glasses.assistant.createSession(AssistantConfig(...))thensession.start(). For programmatic construction (tools loaded from config, conditional registration). The sugar is implemented over the raw form — you can always skip the builder.
tool(...) is a plain Kotlin builder extension, not a runtime-interpreted tree: it appends a ToolDefinition to the config. You could replace every tool(...) { ... } line with tools += ToolDefinition(...) and get the same result.
Tools
A tool is a name, a description the model reads verbatim, and a suspend body that returns a ToolResult. The body is ordinary app code running on Dispatchers.IO — it sees glasses.camera, glasses.audio, your repositories, your DB, third-party SDKs. There is no sandbox; the security boundary is the app/OS level (Android permissions), not the SDK.
tool("take_photo", "Take a photo when the user asks to capture or remember a moment.") {
val photo = glasses.camera.capturePhoto().valueOrNull()
?: return@tool ToolResult.Err("camera failed")
library.add(photo)
ToolResult.Ok("photo saved") // short factual strings — the model reads this aloud
}Three overloads:
| Overload | Signature | Use |
|---|---|---|
| No-arg | tool(name, description) { -> ToolResult } | Most tools — camera, status, simple actions |
| Typed-args | tool<Args>(name, description) { args -> ToolResult } | Args is @Serializable; the JSON Schema is inferred from its descriptor |
| Explicit-schema | tool(name, description, schema) { args -> ToolResult } | Polymorphic types or format constraints (date-time, minLength) |
The typed-args overload needs the org.jetbrains.kotlin.plugin.serialization plugin applied in your app's build.gradle.kts (the serialization runtime ships transitively, but the @Serializable compiler plugin cannot).
ToolResult.Ok(output) feeds a short string back to the model to read or weave into its reply; ToolResult.Err(message) surfaces a failure the model explains ("sorry, the camera failed"). For structured data, emit JSON as a string: ToolResult.Ok("""{"distance_km": 12.4}""").
By default the model speaks a "let me check…" filler while a tool runs. Set blocking = true on a tool that returns in well under 100 ms (a no-arg "what time is it") so the model waits silently and the filler doesn't feel awkward.
Write descriptions for the model. Be specific about when to call ("Take a photo when the user asks to capture a moment"), not what it does internally ("Captures imagery via the camera SDK"). The same description also drives the Mock provider's deterministic test matcher.
Wake and sleep
A session is created Idle. start() moves it to Dormant — set up, but no provider connection open and $0 token spend. You wire any trigger to session.wake(), which opens the WebSocket and transitions to Active. The model runs the conversation; sleep() returns it to Dormant; stop() is terminal.
Idle ──start()──▶ Dormant ──wake()──▶ Activating ──▶ Active
▲ │
└──── Sleeping ◀── sleep() / sleepAfterSilence / sleepOnPhrase / end_conversation
│
any non-Stopped state ──stop()──▶ Stopped (terminal)Active ⇄ Reconnecting is a transparent, library-owned reconnect (OpenAI Realtime caps sessions at ~60 min, and the SDK also reconnects proactively for stability). The conversation history replays on the new connection and the customer never leaves the Active surface; an AssistantEvent.Reconnected fires only for observability. Set startActive = true to skip Dormant and open the connection immediately at start().
The Dormant/Active split is deliberate: it lets you pick any wake mechanism without the library prescribing one.
Wiring the wake
The canonical wake is a voice phrase, reusing the existing voice-trigger system:
glasses.voice.onPhrase("hey coach") { session?.wake() }onPhrase defaults to VoiceScope.WhenDormant, so the phrase won't double-fire during an active conversation. Swap that line for a button onClick, a gesture handler, or an MCP call — anything that calls session.wake(). See voice triggers for how onPhrase matches transcripts; this page won't re-explain it.
Sleeping
| Mechanism | How |
|---|---|
| Model-driven (default) | endOnIntent = true registers a hidden end_conversation tool the model calls when it hears the user wrap up ("bye", "thanks, I'm good") — in any language, no phrase list to maintain. Its body calls session.sleep(). |
| Deterministic phrase | sleepOnPhrase("that's all") — case-insensitive substring on final transcripts, wired through the same voice-command system. |
| Silence timeout | sleepAfterSilence(30.seconds) — auto-sleep after contiguous user silence. Assistant speech pauses (never truncates) the timer. |
| Explicit | Call session.sleep() from your own code. |
Set endOnIntent = false for strict, deterministic-only sleep.
Greeting
On every wake the SDK automatically speaks a greeting, generated out-of-band from the user's memory context — so it greets fresh and can never accidentally continue the prior (ended) conversation.
greeting = Greeting.Custom("Greet the runner warmly in one short sentence and ask how you can help.")Greeting.Default uses the SDK's built-in directive (memory-aware); Greeting.Custom(directive) supplies your own; Greeting.Off opts out (greet manually in onWake { say(...) }, or not at all). This replaces hand-wiring onWake { greet(...) }.
Vision
session.includeImage(uri, prompt?) adds an image to the conversation and auto-triggers a response — the model speaks about it in its configured voice. The image stays in context for follow-up turns. Typical use is from a tool body:
tool("describe_scene", "Describe what the user is looking at. Call for 'what do you see' / 'describe this'.") {
val photo = glasses.camera.capturePhoto().valueOrNull()
?: return@tool ToolResult.Err("camera failed")
val uri = photo.uri ?: return@tool ToolResult.Err("photo had no uri")
glasses.assistant.activeSession?.includeImage(uri) // non-null while a tool dispatches
ToolResult.Ok("looking")
}uri accepts a data: URI (what capturePhoto().uri returns under the sim transport), an http(s) URL (the provider fetches it), or a file:// / content:// / absolute path (the library base64-encodes it). To reach the session from a tool body, use glasses.assistant.activeSession or a field your handler captured from start(...).
Speaking and barge-in
session.say(text)speaks fixed text in the provider's voice — use this instead ofglasses.audio.speak(...)for assistant speech, so it matches the model's own voice rather than the jarringly-different platform TTS engine.session.greet(prompt?)speaks a model-generated, memory-personalized greeting (the manual primitive behind the automatic greeting above).session.cancelSpeak()cancels the model's in-flight utterance — the app/tool-driven barge-in primitive (the model also handles user-voice barge-in automatically via VAD). Use it when a tool result is ready and you want to interrupt the filler to deliver it.
Mid-session setters exist for setReasoningEffort, setVoice, setModel, and updateInstructions. Voice and model bind to the connection, so setVoice/setModel take effect on the next wake, not mid-conversation; updateInstructions and setReasoningEffort apply to the next response immediately.
Provider and configuration
AssistantProvider is a sealed type. v1 ships:
AssistantProvider.OpenAi(model = null, voice = null, turnDetection = ServerVad(), reasoningEffort = Low)— production. Leavemodel/voicenull(the canonicalOpenAi()form) to take the values configured for this project in the dashboard, falling back to the SDK defaultsgpt-realtime-2/alloy. A value passed in code hard-pins it and wins over the dashboard.reasoningEffortdefaults toLow(OpenAI's own recommendation for voice agents — higher settings add noticeable latency).AssistantProvider.Mock(...)— deterministic, in-process, $0. Substring-matches injected utterances against tool descriptions. Powers unit tests and the MCPinjectAssistantUtterancepath.
There is no Claude or Cascaded provider (Anthropic ships no Realtime API as of v1); Gemini Live is future, not present. Precedence for every model-side knob is code > dashboard > SDK default — leave a field null to defer to the dashboard.
Key AssistantConfig / builder fields:
| Field | Default | Meaning |
|---|---|---|
instructions | "" | The full system prompt — you own all of it; the library adds nothing |
startActive | false | true opens the connection at start() (skips Dormant) |
onWake {} / onSleep {} | — | Hooks run as coroutines with the session as receiver (onWake { say("…") }) |
sleepAfterSilence(Duration) | off | Auto-sleep after contiguous user silence |
sleepOnPhrase(phrase) | — | Deterministic sleep phrase |
endOnIntent | true | Register the hidden model-driven end_conversation tool |
greeting | Greeting.Default | Auto-greeting policy (see Greeting) |
historyCap | 100 | Local replay-buffer cap, in turns |
historyCompaction | Auto | What happens as the buffer fills (see Memory) |
The managed gateway is the default
The assistant carries no API key in your app. The SDK opens a WebSocket to Extentos's managed gateway, which relays the realtime session to OpenAI on Extentos's key and meters usage. There is no setOpenaiApiKey — it was removed when the assistant moved to gateway-only. To run on your own OpenAI account, upload your key in the dashboard Credentials section and the gateway swaps it in server-side (BYOK); your handler code doesn't change.
All gateway, BYOK, identity/attestation, metering, and the planned credit billing live in one place — see the managed AI gateway. This page won't re-derive them.
PII note: Phase 4 events carry verbatim transcripts in the dev event log (user_spoke / assistant_spoke) — fine for development, but the transcript is yours to govern in production; document retention in your app's privacy policy. (The gateway meters usage but does not persist conversation content.)
Memory
The assistant has two independent memory layers, both configured on AssistantConfig — neither is a separate capability.
Within-session history and compaction
The SDK keeps a local replay buffer of recent turns (capped at historyCap, default 100) and replays it to the provider on every reconnect — so a conversation survives wake/sleep and the transparent reconnects. The buffer lives in memory; it is not persisted to disk (persist to your own storage if you need it across launches, then restore with replaceHistory).
historyCompaction controls what happens as the buffer fills (it fires in the background near ~80% of the cap):
| Policy | Behavior |
|---|---|
Auto (default) | Summarizes the oldest ~50% of turns via a cheap chat model (compactionModel, default gpt-4o-mini) into one summary turn — the conversation continues indefinitely without silent forgetting. ~$0.001 per compaction. |
DropOldest | Drop the oldest turn when full. Free, lossy. |
Custom(compact) | Your own suspend (List<Turn>) -> List<Turn> compactor — bring your own summarizer, model, or vector-DB recall. |
None | No compaction; the buffer holds at historyCap and you manage it manually via clearHistory / appendHistory / replaceHistory. |
Session history methods — conversationHistory(limit), clearHistory(), appendHistory(turn), replaceHistory(turns) — let you snapshot, wipe (e.g. onWake { session.clearHistory() } for a fresh-each-wake notetaker), inject app-context hints, or restore persisted turns.
Cross-session persistent memory (v0 preview, Android-only)
persistentMemory = true loads this end-user's stored profile at session start and merges durable signal back at session end — so the agent remembers the user across sessions (the automatic greeting personalizes from it).
- Opt-in and consent-gated. It stores a person's data, so it requires the end-user's consent — which you can only obtain in your app. It is therefore a code-side switch, never a dashboard toggle.
- Keyed per-device by default.
memoryUserId = nullkeys the profile on the SDK's attested per-device id (memory follows the device). Set it to your app's stable id for the signed-in user to make memory follow the person across devices and reinstalls (isolated per user on a shared device). The profile is always scoped to your project by the attestation JWT, so one app can never reach another's memory. - Managed-gateway only unless you supply a
MemoryStore. By default the profile lives on the Extentos backend behind the gateway; BYOK has no Extentos store. Provide aMemoryStoreto keep profiles entirely in your own infrastructure — that also enables persistent memory under BYOK.
Errors and events
Open-time failures throw AssistantException wrapping an AssistantError (NoApiKey, AlreadyActive, SessionEnded, NotReady, NetworkError, ProviderError) — pattern-match on .error. Starting a second session while one is active throws AlreadyActive (the runtime is singleton-active per ExtentosGlasses). Once a session is Active, transient errors surface as events and the session rides them out through the reconnection state machine. Full table: error reference.
Lifecycle flows through the shared glasses.runtime.events stream as RuntimeEvent.Assistant wrapping an AssistantEvent (SessionStarted, SessionEnded, UserSpoke, AssistantSpoke, ToolCalled, ToolResultEvent, Reconnected, Error, WentDormant). In the simulator these land on the ai event-log chip (an Error climbs to the errors chip automatically). Capture transcripts off this stream:
glasses.runtime.events
.filterIsInstance<RuntimeEvent.Assistant>()
.onEach { (it.event as? AssistantEvent.UserSpoke)?.let { spoke -> notes.append(spoke.transcript) } }
.launchIn(scope)
glasses.conversation.*(the Phase 3 cascaded runtime) is removed on current Android — useglasses.assistant.*.
Related
- Build a voice assistant — the task guide: wake phrase → tools acting on app state → vision → sleep
- The managed AI gateway — gateway default, BYOK, metering, and the planned credit billing
- Voice triggers —
glasses.voice.onPhrase, the canonical wake mechanism - The display capability — assistant tools can render on the Ray-Ban Display via
glasses.display.* - Capabilities — the full vendor-agnostic SDK vocabulary
- Error reference —
AssistantErrorand the no-DisplayErrormodel
Related
Build a voice assistant
Build a wake-word voice assistant on Meta Ray-Ban smart glasses with glasses.assistant.start. The model owns wake detection, turn-taking, and intent parsing; you write tool bodies that read and act on your app's state, add a vision tool for "what am I looking at", and let the conversation sleep on intent. Runs on the Extentos managed gateway with no API key in your app.
The managed AI gateway
How AI runs in an Extentos app — the assistant routes voice AI through the managed gateway by default. Content relayed, never stored; metered. BYOK opts out.
Voice triggers
Wire a voice command on the glasses to an action in your app. Works on Meta Ray-Ban via the phone's speech recognizer over Bluetooth. Phrases auto-surface on the connection page and the simulator's click-to-fire panel.
The display capability
Render UI on the Ray-Ban Display with the glasses.display Kotlin DSL — text, images, buttons, media, and Neural Band input. Gated per device; Android-first.
Capabilities
The Extentos capability vocabulary — the vendor-agnostic SDK primitives (audio, camera, voice, assistant, display, hardware events) your handler subscribes to.
Error reference
Every typed error the Extentos SDK can return — ConnectError, CaptureError, AudioError, TransportError, the ExtentosError umbrella, and the Meta-DAT DeviceSessionError — with their payload fields and meaning. Lifecycle operations return ExtentosResult<T, E> with these concrete failure variants rather than throwing; pattern-match them. Generated from the Rust core.
The managed AI gateway
How AI runs in an Extentos app — the assistant routes voice AI through the managed gateway by default. Content relayed, never stored; metered. BYOK opts out.
The display capability
Render UI on the Ray-Ban Display with the glasses.display Kotlin DSL — text, images, buttons, media, and Neural Band input. Gated per device; Android-first.