Guides

Build a voice assistant

Build a wake-word voice assistant on Meta Ray-Ban smart glasses with glasses.assistant.start. The model owns wake detection, turn-taking, and intent parsing; you write tool bodies that read and act on your app's state, add a vision tool for "what am I looking at", and let the conversation sleep on intent. Runs on the Extentos managed gateway with no API key in your app.

This guide builds a working voice assistant: a wake phrase opens a conversation, the model calls tools that read and mutate your app's own state (route stats, a clip library, the camera), a vision tool answers "what am I looking at," and the conversation sleeps when the user is done. The model handles wake detection, turn-taking, intent parsing, and confirmation speech — you only write the tool bodies.

For the full API and the wake/sleep state machine, read the assistant runtime first. This page is the task walkthrough.

Preview snapshot. glasses.assistant.* ships in the 1.4.0-phase4-dogfood snapshot via mavenLocal()not on Maven Central (published Android is 1.3.0). Publish the snapshot locally to build against it; see SDK install. iOS is pending — this guide is Android.

1. Bootstrap

Create the ExtentosGlasses handle once at startup (after RECORD_AUDIO is granted) and start your handler. ExtentosGlasses.create() is synchronous — no coroutine needed for it; the coroutine is for the assistant's suspend start / wake, launched from inside the handler.

class MyApp : Application() {
    lateinit var glasses: ExtentosGlasses

    override fun onCreate() {
        super.onCreate()
        glasses = ExtentosGlasses.create(ExtentosConfig(applicationContext = this))
        // No OpenAI key anywhere — the assistant runs on the Extentos managed
        // gateway by default. To use your own key, add it in the dashboard
        // Credentials section; the gateway swaps it in server-side.
        RunAssistantHandler(glasses, routeTracker, library).start()
    }
}

There are no ONNX models, model paths, or opt-in config — the assistant is always available and runs end-to-end over the provider's WebSocket. See the managed gateway for the AI plumbing.

2. Start the session and register tools

The whole assistant is one block. Declare instructions, register tools, and pick a greeting. Leave the provider as AssistantProvider.OpenAi() to take your dashboard's model + voice (defaults gpt-realtime-2 / alloy).

import com.extentos.glasses.core.ExtentosGlasses
import com.extentos.glasses.core.RuntimeEvent
import com.extentos.glasses.core.VideoClip
import com.extentos.glasses.core.VideoConfig
import com.extentos.glasses.core.valueOrNull
import com.extentos.glasses.core.assistant.AssistantEvent
import com.extentos.glasses.core.assistant.AssistantProvider
import com.extentos.glasses.core.assistant.AssistantSession
import com.extentos.glasses.core.assistant.Greeting
import com.extentos.glasses.core.assistant.ToolResult
import com.extentos.glasses.core.assistant.tool
import kotlinx.coroutines.*
import kotlinx.coroutines.flow.filterIsInstance
import kotlinx.coroutines.flow.launchIn
import kotlinx.coroutines.flow.onEach
import kotlin.time.Duration.Companion.seconds

class RunAssistantHandler(
    private val glasses: ExtentosGlasses,
    private val routeTracker: RouteTracker,   // your app state
    private val library: ClipLibrary,         // your app state
    private val scope: CoroutineScope = CoroutineScope(SupervisorJob() + Dispatchers.IO),
) {
    private var session: AssistantSession? = null
    private var activeVideo: Deferred<*>? = null

    fun start() {
        scope.launch {
            session = glasses.assistant.start(provider = AssistantProvider.OpenAi()) {
                instructions = """
                    You are a running companion. Help with route stats and capture
                    moments. Speak briefly — they're running. Don't narrate what
                    you're doing; just do it and confirm. When the user clearly
                    wants to stop talking, call end_conversation.
                """.trimIndent()

                // The SDK speaks this greeting automatically on every wake,
                // generated out-of-band from memory (never continues the prior chat).
                greeting = Greeting.Custom(
                    "Greet the runner warmly in one short sentence and ask how you can help."
                )
                // Deterministic backup sleep after 30s of user silence.
                sleepAfterSilence(30.seconds)

                // Read tools — instant data the model reads aloud.
                tool("get_route_remaining", "How much of the planned route is left, in km.") {
                    ToolResult.Ok("${'$'}{routeTracker.kmRemaining()} km remaining")
                }
                tool("get_average_pace", "The runner's current average pace in minutes per km.") {
                    ToolResult.Ok("${'$'}{routeTracker.avgPaceMinKm()} min per km")
                }

                // Action tools — side effects on your own state. The model
                // manages the take/stop pair from conversational context.
                tool("take_video", "Start recording a video clip of the runner's view.") {
                    if (activeVideo?.isActive == true) {
                        return@tool ToolResult.Err("a recording is already in progress")
                    }
                    activeVideo = scope.async {
                        glasses.camera.captureVideo(VideoConfig(maxDurationSeconds = 30))
                    }
                    ToolResult.Ok("recording started")
                }
                tool("stop_video", "Stop the current video recording.") {
                    val capture = activeVideo ?: return@tool ToolResult.Err("nothing was recording")
                    activeVideo = null
                    glasses.camera.stopVideo()                 // resumes the await naturally
                    @Suppress("UNCHECKED_CAST")
                    val clip = (capture as Deferred<com.extentos.glasses.core.ExtentosResult<VideoClip, *>>)
                        .await().valueOrNull() ?: return@tool ToolResult.Err("video capture failed")
                    library.add(clip)
                    ToolResult.Ok("video saved")
                }
            }

            // Wake mechanism — the same voice-trigger system as everywhere.
            // Defaults to VoiceScope.WhenDormant so it won't double-fire mid-chat.
            glasses.voice.onPhrase("hey coach") { session?.wake() }
        }
    }
}

A few things the model does for you here, that you'd otherwise hand-wire: it detects the wake intent, decides which tool to call from each description, speaks a "let me check…" filler while a tool runs (suppress per-tool with blocking = true), and ends the conversation via the hidden end_conversation tool (endOnIntent defaults true) when the user wraps up — no rigid "goodbye" phrase required.

Clean stop pattern. To stop the recording, call glasses.camera.stopVideo() and await() the in-flight capture — it resumes with a partial clip. Don't cancel() the Deferred: Kotlin's cancelled state is sticky and await() would throw even when a partial exists.

3. Add a vision tool

session.includeImage(uri) adds a photo to the conversation and auto-triggers a spoken response in the model's voice. Reach the running session from a tool body via glasses.assistant.activeSession (non-null while a tool dispatches). Keep "describe" and "save" as two tools so the model can disambiguate — and call both back-to-back when the user wants both.

tool(
    "describe_scene",
    "Describe what the runner is looking at without saving the photo. " +
        "Call for 'what do you see' / 'describe this' / 'tell me about this'.",
) {
    val photo = glasses.camera.capturePhoto().valueOrNull()
        ?: return@tool ToolResult.Err("camera failed")
    val uri = photo.uri ?: return@tool ToolResult.Err("photo had no uri")
    glasses.assistant.activeSession?.includeImage(uri)
    ToolResult.Ok("looking")
}

Camera tools need the simulator browser tab attached when you test them — see step 5.

4. Capture transcripts (optional)

If you want the user's words for notes, captions, or a journal, subscribe to the shared runtime-event stream. Phase 4 events carry verbatim transcripts (yours to govern; document retention in your privacy policy).

glasses.runtime.events
    .filterIsInstance<RuntimeEvent.Assistant>()
    .onEach { evt ->
        (evt.event as? AssistantEvent.UserSpoke)?.let { notes.append(it.transcript) }
    }
    .launchIn(scope)

5. Test it in the simulator

The agent-driven loop closes without a human or hardware. Drive the wake with the real wake path, then inject a user utterance, then assert the tool fired:

// Wake exactly like a real user saying the phrase:
await injectTranscript({ sessionId, text: "hey coach" });
// ... wait for assistant.session_started in getEventLog(filter: "ai"), then:
const inj = await injectAssistantUtterance({ sessionId, text: "how much further?" });
await assertToolCalled({ sessionId, name: "get_route_remaining", sinceCursor: inj.watchCursor });
  • injectAssistantUtterance drives both the Mock provider (deterministic, $0) and the real OpenAi provider with the same call — pick via a build flavor, never useMock in handler code.
  • The session must be Active when you inject — wake it first and watch for assistant.session_started.
  • Always thread inj.watchCursor into assertToolCalled({ sinceCursor }) — the model fires the tool 0.5–2 s later, and anchoring the wait before the inject avoids a false "tool not called."
  • Camera/vision tools (describe_scene) need the browser tab attached — call ensureSimulatorBrowser({ sessionId }) first, and budget a longer timeoutMs (camera + upload + reasoning).

Lifecycle and transcripts land on the simulator's ai event-log chip (an assistant error climbs to errors). See the MCP tools reference for the full agent-test surface.

Sim fidelity. The simulator runs the same library code as production with only the transport and audio path swapped, so it's designed to behave identically to hardware. That fidelity is under active validation on Android right now — treat the sim as the iteration loop and a real-hardware run as the final gate, not a guarantee.

Where to go next

  • Render on the glasses screen. On a Ray-Ban Display, a tool can show results visually instead of (or alongside) speaking — see render on the display; gate on glasses.display.isAvailable.
  • Tune the model. AssistantProvider.OpenAi(reasoningEffort = …), mid-session setReasoningEffort / updateInstructions, and historyCompaction for long conversations — all in the assistant runtime.
  • Persist memory across sessions. persistentMemory = true (opt-in, consent-gated, Android preview) — see memory.
  • Bring your own key. Upload an OpenAI key in the dashboard to move spend to your account — see the managed AI gateway.