The assistant runtime

Build a voice assistant on smart glasses with glasses.assistant — wake/sleep, tools, vision, barge-in, memory, on the managed AI gateway. Kotlin and Swift.

Camera and display need the Meta vendor module. com.extentos:glasses carries no vendor SDK, so add implementation("com.extentos:glasses-meta") alongside it — see install. Without it the build still succeeds and voice still works, but capabilities.camera is false and captures return errors. The SDK logs a warning at startup when it spots that combination.

The assistant runtime (glasses.assistant.*) is the canonical way to build a voice AI on Extentos. It wraps an end-to-end speech-to-speech provider — OpenAI Realtime by default — so the model itself owns wake detection, turn-taking, intent parsing, and confirmation speech. Your code shrinks to one block: declare instructions, register tool(name, description) { ... } bodies that act on your app's own state, and wire a wake trigger. The model decides which tool to call from each tool's natural-language description; there is no keyword routing, no when (transcript) ladder, no spec file.

Ships on both platforms, on the managed gateway. The assistant runtime has shipped in com.extentos:glasses since 1.4.0; install the current release (Android install) and in the iOS SDK's GlassesCore product from github.com/extentos/swift-glasses (iOS install). The code on this page is Kotlin, with Swift counterparts folded in under the main sections. The fields and behaviour are the same on both; several spellings differ — the full list is under Swift call-shape deltas.

The shape of an assistant

import com.extentos.glasses.core.assistant.AssistantProvider
import com.extentos.glasses.core.assistant.AssistantSession
import com.extentos.glasses.core.assistant.ToolResult
import com.extentos.glasses.core.assistant.tool

class RunCompanion(private val glasses: ExtentosGlasses, private val scope: CoroutineScope) {
    private var session: AssistantSession? = null

    fun start() = scope.launch {
        session = glasses.assistant.start(
            // model + voice come from your dashboard Agent settings; pass them
            // here only to hard-pin in code (a code value wins over the dashboard).
            // Defaults if neither is set: gpt-realtime-2 + alloy.
            provider = AssistantProvider.Managed(),
        ) {
            instructions = "You are a running companion. Speak briefly — they're running."

            tool("get_pace", "The runner's current average pace in minutes per km.") {
                ToolResult.Ok("${routeTracker.avgPaceMinKm()} min per km")
            }
        }
        // Any trigger that calls session.wake() works — here, a wake phrase.
        glasses.voice.onPhrase("hey coach") { session?.wake() }
    }
}

The same thing in Swift

import GlassesCore

final class RunCompanion {
    private let glasses: any ExtentosGlasses
    private var session: (any AssistantSession)?

    init(glasses: any ExtentosGlasses) { self.glasses = glasses }

    func start() async throws {
        session = try await glasses.assistant.start(provider: .managed()) {
            $0.instructions = "You are a running companion. Speak briefly — they're running."

            $0.tool("get_pace", description: "The runner's current average pace in minutes per km.") {
                .ok("\(routeTracker.avgPaceMinKm()) min per km")
            }
        }
        _ = glasses.voice.onPhrase(phrase: "hey coach", label: "Wake", stops: []) { [weak self] in
            try? await self?.session?.wake()
        }
    }
}

The config block takes the builder as a parameter, so every field is $0.something rather than a bare assignment. start(provider:_:) is async throws — an open-time failure raises AssistantError directly, with no exception wrapper to unwrap.

Two registration forms exist and produce identical behavior:

Sugar — glasses.assistant.start(provider) { ... }. The trailing lambda is an AssistantConfigBuilder; idiomatic for the common case. It creates a session and starts it for you.
Raw — glasses.assistant.createSession(AssistantConfig(...)) then session.start(). For programmatic construction (tools loaded from config, conditional registration). The sugar is implemented over the raw form — you can always skip the builder.

tool(...) is a plain Kotlin builder extension, not a runtime-interpreted tree: it appends a ToolDefinition to the config. You could replace every tool(...) { ... } line with tools += ToolDefinition(...) and get the same result.

Tools

A tool is a name, a description the model reads verbatim, and a suspend body that returns a ToolResult. The body is ordinary app code running on Dispatchers.IO — it sees glasses.camera, glasses.audio, your repositories, your DB, third-party SDKs. There is no sandbox; the security boundary is the app/OS level (Android permissions), not the SDK.

tool("take_photo", "Take a photo when the user asks to capture or remember a moment.") {
    val photo = glasses.camera.capturePhoto().valueOrNull()
        ?: return@tool ToolResult.Err("camera failed")
    library.add(photo)
    ToolResult.Ok("photo saved")          // short factual strings — the model reads this aloud
}

The same tool in Swift

$0.tool("take_photo", description: "Take a photo when the user asks to capture or remember a moment.") {
    let result = await glasses.camera.capturePhoto()
    guard case .success(let photo) = result else { return .err("camera failed") }
    library.add(photo)
    return .ok("photo saved")          // short factual strings — the model reads this aloud
}

Three spellings differ from Kotlin: the description is a labelled argument (description:), results are .ok / .err, and the typed-args overload takes an explicit schema: — Swift can't infer a JSON Schema from a type the way Kotlin does from @Serializable.

Three overloads:

Overload	Signature	Use
No-arg	`tool(name, description) { -> ToolResult }`	Most tools — camera, status, simple actions
Typed-args	`tool<Args>(name, description) { args -> ToolResult }`	`Args` is `@Serializable`; the JSON Schema is inferred from its descriptor
Explicit-schema	`tool(name, description, schema) { args -> ToolResult }`	Polymorphic types or format constraints (`date-time`, `minLength`)

The typed-args overload needs the org.jetbrains.kotlin.plugin.serialization plugin applied in your app's build.gradle.kts (the serialization runtime ships transitively, but the @Serializable compiler plugin cannot).

ToolResult.Ok(output) feeds a short string back to the model to read or weave into its reply; ToolResult.Err(message) surfaces a failure the model explains ("sorry, the camera failed"). For structured data, emit JSON as a string: ToolResult.Ok("""{"distance_km": 12.4}""").

Built-in tool. The SDK registers one tool of its own by default — get_device_info — so the model can check the glasses' current capabilities (camera, microphone, display availability, device model) on demand rather than having them injected into every turn (which it would otherwise sometimes surface mid-conversation). Turn it off with includeDeviceInfoTool = false (e.g. if you register your own), or rewrite the one-line system-prompt hint that points the model at it with deviceInfoNote. Both live on AssistantConfig, not on the start(provider) { … } builder — to change either, build the config directly and use assistant.createSession(AssistantConfig(provider = …, includeDeviceInfoTool = false)) followed by session.start(). (end_conversation, covered under Sleeping below, is the other built-in.)

By default the model speaks a "let me check…" filler while a tool runs. Set blocking = true on a tool that returns in well under 100 ms (a no-arg "what time is it") so the model waits silently and the filler doesn't feel awkward.

Write descriptions for the model. Be specific about when to call ("Take a photo when the user asks to capture a moment"), not what it does internally ("Captures imagery via the camera SDK"). The same description also drives the Mock provider's deterministic test matcher.

Wake and sleep

A session is created Idle. start() moves it to Dormant — set up, but no provider connection open and $0 token spend. You wire any trigger to session.wake(), which opens the WebSocket and transitions to Active. The model runs the conversation; sleep() returns it to Dormant; stop() is terminal.

Idle ──start()──▶ Dormant ──wake()──▶ Activating ──▶ Active
                    ▲                                  │
                    └──── Sleeping ◀── sleep() / sleepAfterSilence / sleepOnPhrase / end_conversation
                                                       │
   any non-Stopped state ──stop()──▶ Stopped (terminal)

Active ⇄ Reconnecting is a transparent, library-owned reconnect (OpenAI Realtime caps sessions at ~60 min, and the SDK also reconnects proactively for stability). The conversation history replays on the new connection and the customer never leaves the Active surface; an AssistantEvent.Reconnected fires only for observability. Set startActive = true to skip Dormant and open the connection immediately at start().

The Dormant/Active split is deliberate: it lets you pick any wake mechanism without the library prescribing one.

Wiring the wake

The canonical wake is a voice phrase, reusing the existing voice-trigger system:

glasses.voice.onPhrase("hey coach") { session?.wake() }

On Android, onPhrase defaults to VoiceScope.WhenDormant, so the phrase won't double-fire during an active conversation (the Swift onPhrase has no scope parameter — gate re-wakes in your handler if needed). Swap that line for a button onClick, a gesture handler, or an MCP call — anything that calls session.wake(). See voice triggers for how onPhrase matches transcripts; this page won't re-explain it.

Sleeping

Mechanism	How
Model-driven (default)	`endOnIntent = true` registers a hidden `end_conversation` tool the model calls when it hears the user wrap up ("bye", "thanks, I'm good") — in any language, no phrase list to maintain. Its body calls `session.sleep()`.
Deterministic phrase	`sleepOnPhrase("that's all")` — case-insensitive substring on final transcripts, wired through the same voice-command system.
Silence timeout	`sleepAfterSilence(30.seconds)` — auto-sleep after contiguous user silence. Assistant speech pauses (never truncates) the timer.
Explicit	Call `session.sleep()` from your own code.

Set endOnIntent = false for strict, deterministic-only sleep.

Greeting

On every wake the SDK automatically speaks a greeting, generated out-of-band from the user's memory context — so it greets fresh and can never accidentally continue the prior (ended) conversation.

greeting = Greeting.Custom("Greet the runner warmly in one short sentence and ask how you can help.")

Greeting.Default uses the SDK's built-in directive (memory-aware); Greeting.Custom(directive) supplies your own; Greeting.Off opts out (greet manually in onWake { say(...) }, or not at all). This replaces hand-wiring onWake { greet(...) }.

Vision

session.includeImage(uri, prompt?) adds an image to the conversation and auto-triggers a response — the model speaks about it in its configured voice. The image stays in context for follow-up turns. Typical use is from a tool body:

tool("describe_scene", "Describe what the user is looking at. Call for 'what do you see' / 'describe this'.") {
    val photo = glasses.camera.capturePhoto().valueOrNull()
        ?: return@tool ToolResult.Err("camera failed")
    val uri = photo.uri ?: return@tool ToolResult.Err("photo had no uri")
    glasses.assistant.activeSession?.includeImage(uri)   // non-null while a tool dispatches
    ToolResult.Ok("looking")
}

uri accepts a data: URI (what capturePhoto().uri returns under the sim transport), an http(s) URL (the provider fetches it), or a file:// / content:// / absolute path (the library base64-encodes it). To reach the session from a tool body, use glasses.assistant.activeSession or a field your handler captured from start(...).

Speaking and barge-in

session.say(text) speaks fixed text in the provider's voice — use this instead of glasses.audio.speak(...) for assistant speech, so it matches the model's own voice rather than the jarringly-different platform TTS engine. (Outside a session, glasses.audio.speak(...) can itself speak in the on-device high-quality voice via SpeakConfig(voice = "kokoro") — see local models.)
session.greet(prompt?) speaks a model-generated, memory-personalized greeting (the manual primitive behind the automatic greeting above).
session.cancelSpeak() cancels the model's in-flight utterance — the app/tool-driven barge-in primitive (the model also handles user-voice barge-in automatically via VAD). Use it when a tool result is ready and you want to interrupt the filler to deliver it.

Mid-session setters exist for setReasoningEffort, setVoice, setModel, and updateInstructions. Voice and model bind to the connection, so setVoice/setModel take effect on the next wake, not mid-conversation; updateInstructions and setReasoningEffort apply to the next response immediately.

Provider and configuration

AssistantProvider is a sealed type. It ships:

AssistantProvider.Managed(model = null, voice = null, turnDetection = ServerVad(), reasoningEffort = Low) — production. Leave model/voice null to take the values configured for this project in the dashboard, falling back to the SDK defaults gpt-realtime-2 / alloy. A value passed in code hard-pins it and wins over the dashboard. reasoningEffort defaults to Low (OpenAI's own recommendation for voice agents — higher settings add noticeable latency).
AssistantProvider.Mock(...) — deterministic, in-process, $0. Substring-matches injected utterances against tool descriptions. Powers unit tests and the MCP injectAssistantUtterance path.

The model id picks the vendor. AssistantProvider.Managed is the managed realtime provider — the resolved model id (code-pin or dashboard) selects which upstream actually runs the session: gpt-* → OpenAI Realtime, grok-* → xAI Grok, gemini-* → Google Gemini Live. Switching vendors is a dashboard dropdown, not a code change: the SDK speaks each vendor's wire protocol natively (the shared core carries a protocol adapter per vendor) and every session exposes the same events, tools, transcripts, history, and barge-in regardless of vendor. Grok and Gemini run managed-gateway only.

Supported realtime models

Model	Vendor	Stands out for	Video input
`gpt-realtime-2.1`	OpenAI	strongest reasoning, live effort knob	—
`gpt-realtime-2.1-mini`	OpenAI	reasoning at mini price	—
`gpt-realtime-2` (default)	OpenAI	strongest reasoning, live effort knob	—
`gpt-realtime-1.5`	OpenAI	flagship audio quality	—
`gpt-realtime-mini`	OpenAI	lowest latency and price	—
`grok-voice-think-fast-1.0`	xAI	reasons while speaking	—
`gemini-3.1-flash-live-preview`	Google	streaming video input — the assistant sees what you see	✓
`gemini-2.5-flash-native-audio-preview-12-2025`	Google	most expressive native speech	—
`local-auto` (Automatic)	Extentos	picks the best on-phone model each device can run, cloud when none fits	—
`local-qwen3-8b`	Extentos	ladder top (Qwen 3 8B), $0 — beyond today's phones	—
`local-qwen3-4b`	Extentos	strongest on-phone tier (Qwen 3 4B), $0 · ~2.9 GB device RAM on iOS	—
`local-qwen3-1.7b`	Extentos	on-phone tier (Qwen 3 1.7B), $0 · ~1.2 GB device RAM on iOS	—
`local-qwen25-1.5b`	Extentos	lightest tool-capable tier (Qwen 2.5 1.5B), $0 · ~1.0 GB on iOS	—
`local-qwen3-0.6b`	Extentos	ultra-light (Qwen 3 0.6B), $0 — chat-capable, weak tool calling	—

The local-* models need extra setup. They are not available with the base dependency: add com.extentos:glasses-local (plus glasses-local-voice for the neural voice) and call ExtentosLocalTier.register(applicationContext) before creating the SDK — see local models. Selecting a local-* model in the dashboard without those will fail at wake.

Pick local-auto unless you have a reason not to. Device memory varies enormously and you cannot know what each user's phone can hold; Automatic resolves it per device and reports what it chose. Full behaviour — the ladder, the end-user download path, and on-device voices — is in On-device models. Note the RAM figures above are the iOS ones; Android runs different quantisations and needs more.

The selected model is the served model — always. The SDK never substitutes a different local model: deviceFit(for:) reports whether a model fits the current device so your app can present download choices honestly, and a selected-but-unfit model refuses to load with a clear error rather than silently serving something else. In the browser simulator, local models need no download at all — the session is served the same model from Extentos infrastructure, so you can evaluate every local rung (including ones your phone can't hold) before shipping; only real hardware runs inference on-device. The single exception is local-auto, whose entire meaning is "resolve for me" — and it reports the concrete model it chose, every session, via AssistantEvent.AutoModelResolved. Delegation, never a silent swap.

The one way to get a silent swap is to forget ExtentosLocalTier.register(...): iOS then has no on-device brain to reach and serves from the cloud without saying so, while Android refuses loudly. Register before you create the SDK handle, and verify with AssistantEvent.AutoModelResolved rather than trusting the absence of an error — see local models.

Local models carry an Extentos conduct layer. Cloud vendors bake a conversational alignment layer beneath whatever instructions you write; small on-device models have no vendor underneath, so Extentos provides that floor — a short core-owned preamble that keeps the model conversational and reaches for tools only when asked, sitting under your instructions exactly like vendor alignment does on cloud models. Write no instructions and a sensible default applies; set localConductFloor: false in AssistantConfig for raw model behavior. The built-in get_device_info tool and its system note work on local models the same as cloud — same names, same wording, so switching models never changes what the assistant knows about the hardware.

Local models price in device resources rather than dollars: the dashboard shows each model's on-device RAM, and the local voice options price the same way — System (default, no extra RAM; the phone's own voice, which iOS upgrades automatically to the best voice the user has installed) or High-quality (Kokoro), an on-device neural voice that adds ~300 MB and extra reply latency (synthesis runs on the phone; faster devices feel it less). End users can improve the System voice themselves by downloading a premium voice under Settings → Accessibility → Spoken Content.

Every model is selectable per-project in the dashboard's Agent tab; pass the id to AssistantProvider.Managed(model = …) only to hard-pin it in code. Gemini Live models require SDK 1.8.0+ — earlier SDK versions predate the Gemini protocol adapter, and a session on a gemini-* model will fail at connect. Reasoning effort maps per vendor: a live knob on the OpenAI 2.x line (gpt-realtime-2, gpt-realtime-2.1, gpt-realtime-2.1-mini); on Gemini it sets the thinking level at the next wake (Minimal is Google's own speed-optimized default — recommended for snappy voice agents); Grok manages its own.

Despite the name, gpt-realtime-2.1-mini is a reasoning model with the full 128k context window — it is a distilled 2.1, not a newer gpt-realtime-mini. The older gpt-realtime-mini has no effort knob and a 32k window.

There is no Claude or Cascaded provider (Anthropic ships no Realtime API). Precedence for every model-side knob is code > dashboard > SDK default — leave a field null to defer to the dashboard.

Vendor parity notes (Gemini Live)

The event vocabulary and session lifecycle are identical across vendors. A few knobs behave differently on Gemini Live, where the session config is fixed at connect time:

Surface	OpenAI / Grok	Gemini Live
Wake→listen→answer, barge-in, tools, transcripts, history, sleep-on-silence	✓	✓ identical
`say(text)`	verbatim	near-verbatim (instruction-driven)
`cancelSpeak()`	cancels at the source	stops playback locally; the turn's remaining audio is discarded
`updateInstructions()`	applies immediately	applies at the next wake/reconnect
`setVoice` / `setModel` / `setReasoningEffort`	next wake	next wake
`withinSessionMemory` (dashboard)	smart = SDK summarization	handled natively by the model's sliding window (knob is a no-op; no summarizer cost)
`includeImage(uri)`	✓	✓
`sendVideoFrame(jpegBytes)` — streaming video input	✗ (error event)	✓ on `gemini-3.1-flash-live-preview` — the only realtime model that can see what the user sees

Streaming video input (Gemini 3.1 Flash Live)

session.sendVideoFrame(jpegBytes) streams a camera frame into the live conversation — the assistant comments on what the glasses see, mid-dialogue. It is gated per-model: only gemini-3.1-flash-live-preview ingests video (other models emit an Error(kind = "video_input_unsupported") event instead of silently dropping the frame). glasses.camera.videoFrames() delivers JPEG frames by default — identical on simulator and hardware — so frame.data feeds it directly:

val frames = glasses.camera.videoFrames(VideoFrameConfig(frameRate = 2, resolution = Resolution.LOW))
frames.sample(1_000).collect { frame ->                  // ~1 fps is plenty (and cheap)
    glasses.assistant.activeSession?.sendVideoFrame(frame.data)
}

At the session's low media resolution a frame costs on the order of tens of input tokens, so a 1 fps loop is inexpensive; continuous camera + duplex audio is the fragile coexistence path on real glasses, so prefer on-demand bursts over always-on streaming.

Gate your video tools with session.modelSupportsVideoInput. Only video-capable models ingest frames, and the honest failure belongs in your tool so the model speaks it:

tool("start_watching", "Stream the user's view so you can see it live.") {
    if (!session.modelSupportsVideoInput) {
        return@tool ToolResult.Err(
            "this model can't see video — switch the project to Gemini 3.1 Flash Live " +
            "in the dashboard's Agent settings")
    }
    // ... start the frame loop ...
}

Even without the check, the SDK never lets a blind session pretend: a sendVideoFrame on an unsupported model emits AssistantEvent.Error(kind = "video_input_unsupported") to your app and injects a one-time note into the conversation telling the model it cannot see — so the agent answers "I can't watch video on this model" instead of hallucinating a description.

Resolution: recording quality vs. agent frames

The glasses run one camera stream, and its resolution locks at first arm (it can be re-armed after the firmware retires the stream, ~5 min). Two facts make this easy to wire well:

Gemini tokenizes frames at the session's low media resolution regardless of what you send — feeding it high-resolution frames costs upload bandwidth, not extra tokens, and doesn't degrade its vision. The agent adapts to whatever the stream provides.
The first camera call's config decides the armed stream — you program this. If your app both records video and streams to the agent, arm HIGH (make the recording-quality call shape first, or request HIGH in your videoFrames config): the recording keeps full quality and the agent watches the same stream for free. Arming LOW first is the one trap — a later recording is stuck at LOW until the stream retires. camera.activeStreamInfo() tells you what the stream is currently armed at — it's a function, and it returns null when nothing is armed.
Want one quality for every stream instead of per-call control? Set the standing override — glasses.camera.preferredStreamConfig = PreferredStreamConfig(Resolution.HIGH) (or the same field on ExtentosConfig at startup): every arm then uses it and per-call values are ignored. And if a call ever requests better quality than the armed stream can give, the SDK emits one stream_config_conflict warning on runtime.events naming the armed config and the fix — it never re-arms behind your back.

Key AssistantConfig / builder fields:

Field	Default	Meaning
`instructions`	`""`	The full system prompt — you own all of it; the library adds nothing
`startActive`	`false`	`true` opens the connection at `start()` (skips Dormant)
`onWake {}` / `onSleep {}`	—	Hooks run as coroutines with the session as receiver (`onWake { say("…") }`)
`sleepAfterSilence(Duration)`	off	Auto-sleep after contiguous user silence
`sleepOnPhrase(phrase)`	—	Deterministic sleep phrase
`endOnIntent`	`true`	Register the hidden model-driven `end_conversation` tool
`greeting`	`Greeting.Default`	Auto-greeting policy (see Greeting)
`historyCap`	`100`	Local replay-buffer cap, in turns
`historyCompaction`	`Auto`	What happens as the buffer fills (see Memory)

What language does the assistant speak and hear?

There is no language setting on the assistant — not on AssistantProvider, not in the start(provider) { … } block, not on AssistantSession. What happens instead depends on which brain is serving:

Cloud models (gpt-*, grok-*, gemini-*). Speech recognition happens inside the provider, and Extentos sends no language field — so input language is auto-detected. Output language is whatever the model decides, which means you steer it through instructions:

instructions = "You are an interpreter. Detect the language the user speaks " +
    "and reply in the other one. Speak only the translation."

That is prompt-level, not an API guarantee, but for multilingual conversation it's the working mechanism — and it's why an interpreter-style app is practical on the cloud tier today.

On-device models (local-*). The ears are the SDK's own on-device recognizer, which the assistant path invokes with default settings — meaning en-US, with no way to change it from the assistant API. On real glasses and the audio baseline that recognizer is English-only regardless (see audio streaming). The bundled on-device voices are US/UK English as well.

If your product is multilingual, the local tier is not ready for it. On-device gives you privacy and $0 inference in English. Anything else needs a cloud model, where the language handling above applies. This is a real limitation, not a configuration you're missing — don't spend a day looking for the setting.

Swift call-shape deltas

The Swift surface carries the same fields and behaviour; the spellings differ more than "field-for-field" suggests. Assume these deltas rather than transliterating the Kotlin:

The config block is an inout closure, not a receiver lambda. Kotlin's instructions = … is Swift's $0.instructions = ….
Names differ in case. AssistantProvider.Managed(...) is .managed(...); ToolResult.Ok / .Err are .ok / .err.
Tool registration labels its description. Kotlin tool("name", "description") { … } is Swift $0.tool("name", description: "…") { … }.
Pinning a model looks the same on both: AssistantProvider.Managed(model = "local-auto") / .managed(model: "local-auto").
glasses.assistant.start(provider:_:) is async throws — open-time failures throw AssistantError directly (Swift enums conform to Error, so there's no AssistantException wrapper to unwrap).
sleepAfterSilence(_:) takes a TimeInterval in seconds instead of Kotlin's Duration.
Session state is observed via session.state, an ObservableState<AssistantState> — a current snapshot plus a replaying stream (StateFlow semantics).
The typed tool overload takes an explicit schema: parameter — Swift can't infer a JSON Schema from a type the way Kotlin does from @Serializable. The no-arg tool(name, description) { ... } form is identical, and the schema sent to the provider is the same on both platforms.

The managed gateway is the default

The assistant carries no API key in your app. The SDK opens a WebSocket to Extentos's managed gateway, which relays the realtime session to OpenAI on Extentos's key and meters usage. There is no setOpenaiApiKey — it was removed when the assistant moved to gateway-only. (There is no option to run the assistant on your own provider account — see the gateway; today it always runs on Extentos's managed key.)

All gateway, identity/attestation, metering, and credit billing live in one place — see the managed AI gateway. This page won't re-derive them.

PII note: Phase 4 events carry verbatim transcripts in the dev event log (user_spoke / assistant_spoke) — fine for development, but the transcript is yours to govern in production; document retention in your app's privacy policy. (The gateway meters usage but does not persist conversation content.)

Memory

The assistant has two independent memory layers, both configured on AssistantConfig — neither is a separate capability.

Within-session history and compaction

The SDK keeps a local replay buffer of recent turns (capped at historyCap, default 100) and replays it to the provider on every reconnect — so a conversation survives wake/sleep and the transparent reconnects. The buffer lives in memory; it is not persisted to disk (persist to your own storage if you need it across launches, then restore with replaceHistory).

historyCompaction controls what happens as the buffer fills (it fires in the background near ~80% of the cap):

Policy	Behavior
`Auto` (default)	Summarizes the oldest ~50% of turns via a cheap chat model (`compactionModel`, default `gpt-4o-mini`) into one summary turn — the conversation continues indefinitely without silent forgetting. ~$0.001 per compaction.
`DropOldest`	Drop the oldest turn when full. Free, lossy.
`Custom(compact)`	Your own `suspend (List<Turn>) -> List<Turn>` compactor — bring your own summarizer, model, or vector-DB recall.
`None`	No compaction; the buffer holds at `historyCap` and you manage it manually via `clearHistory` / `appendHistory` / `replaceHistory`.

Session history methods — conversationHistory(limit), clearHistory(), appendHistory(turn), replaceHistory(turns) — let you snapshot, wipe (e.g. onWake { session.clearHistory() } for a fresh-each-wake notetaker), inject app-context hints, or restore persisted turns.

Cross-session persistent memory (v0 preview)

persistentMemory = true loads this end-user's stored profile at session start and merges durable signal back at session end — so the agent remembers the user across sessions (the automatic greeting personalizes from it).

Opt-in and consent-gated. It stores a person's data, so it requires the end-user's consent — which you can only obtain in your app. It is therefore a code-side switch, never a dashboard toggle.
Keyed per-device by default. memoryUserId = null keys the profile on the SDK's attested per-device id (memory follows the device). Set it to your app's stable id for the signed-in user to make memory follow the person across devices and reinstalls (isolated per user on a shared device). The profile is always scoped to your project by the attestation JWT, so one app can never reach another's memory.
Gateway-backed unless you supply a MemoryStore. By default the profile lives on the Extentos backend behind the gateway. Provide a MemoryStore to keep profiles entirely in your own infrastructure.

Errors and events

Open-time failures throw AssistantException wrapping an AssistantError (NoApiKey, AlreadyActive, SessionEnded, NotReady, NetworkError, ProviderError) — pattern-match on .error. Starting a second session while one is active throws AlreadyActive (the runtime is singleton-active per ExtentosGlasses). Once a session is Active, transient errors surface as events and the session rides them out through the reconnection state machine. Full table: error reference.

Lifecycle flows through the shared glasses.runtime.events stream as RuntimeEvent.Assistant wrapping an AssistantEvent (SessionStarted, SessionEnded, UserSpoke, AssistantSpoke, ToolCalled, ToolResultEvent, Reconnected, Error, WentDormant). In the simulator these land on the voice event-log chip — assistant lifecycle, STT and TTS all share it (an Error climbs to the errors chip automatically). The ai chip is reserved for customer-side BYOK calls wrapped in glasses.observability.aiCall(...), not the assistant runtime. Capture transcripts off this stream:

glasses.runtime.events
    .filterIsInstance<RuntimeEvent.Assistant>()
    .onEach { (it.event as? AssistantEvent.UserSpoke)?.let { spoke -> notes.append(spoke.transcript) } }
    .launchIn(scope)

glasses.conversation.* (the Phase 3 cascaded runtime) is removed on current Android — use glasses.assistant.*.

Build a voice assistant — the task guide: wake phrase → tools acting on app state → vision → sleep
The managed AI gateway — gateway default, metering, and credit billing
Voice triggers — glasses.voice.onPhrase, the canonical wake mechanism
The display capability — assistant tools can render on the Ray-Ban Display via glasses.display.*
Capabilities — the full vendor-agnostic SDK vocabulary
Error reference — AssistantError and the no-DisplayError model

Build a voice assistant

Build a wake-word voice assistant on Meta Ray-Ban smart glasses with glasses.assistant.start. A phone-side wake phrase opens the conversation; the model owns turn-taking and intent parsing; you write tool bodies that read and act on your app's state, add a vision tool for "what am I looking at", and let the conversation sleep on intent. Runs on the Extentos managed gateway with no API key in your app.

The managed AI gateway

How AI runs in an Extentos app — the assistant routes voice AI through the managed gateway. Content relayed, never stored; metered. The assistant always runs on the managed gateway; there is no bring-your-own-key option.

Voice triggers

Wire a voice command on the glasses to an action in your app. Works on Meta Ray-Ban via the phone's speech recognizer over Bluetooth. Phrases auto-surface on the connection page and the simulator's click-to-fire panel.

The display capability

Render UI on the Ray-Ban Display with the glasses.display builder DSL in Kotlin and Swift — text, images, buttons, media, and Neural Band input. Gated per device. Beta.

Capabilities

The Extentos capability vocabulary — the vendor-agnostic SDK primitives (audio, camera, voice, assistant, display, hardware events) your handler subscribes to.

Error reference

Every typed error the Extentos SDK can return — ConnectError, CaptureError, AudioError, TransportError, the ExtentosError umbrella, and the Meta-DAT DeviceSessionError — with their payload fields and meaning. Lifecycle operations return ExtentosResult<T, E> with these concrete failure variants rather than throwing; pattern-match them. Generated from the Rust core.