---
title: Build a voice assistant
description: Build a wake-word voice assistant on Meta Ray-Ban smart glasses with glasses.assistant.start. The model owns wake detection, turn-taking, and intent parsing; you write tool bodies that read and act on your app's state, add a vision tool for "what am I looking at", and let the conversation sleep on intent. Runs on the Extentos managed gateway with no API key in your app.
type: guide
platform: android
vendor: meta
related:
  - /docs/concepts/assistant
  - /docs/concepts/ai-gateway
  - /docs/guides/voice-triggers
  - /docs/guides/display
  - /docs/concepts/capabilities
---

This guide builds a working voice assistant: a wake phrase opens a conversation, the model calls **tools** that read and mutate your app's own state (route stats, a clip library, the camera), a **vision** tool answers "what am I looking at," and the conversation **sleeps** when the user is done. The model handles wake detection, turn-taking, intent parsing, and confirmation speech — you only write the tool bodies.

For the full API and the wake/sleep state machine, read [the assistant runtime](/docs/concepts/assistant) first. This page is the task walkthrough.

> **Preview snapshot.** `glasses.assistant.*` ships in the **`1.4.0-phase4-dogfood`** snapshot via `mavenLocal()` — **not** on Maven Central (published Android is `1.3.0`). Publish the snapshot locally to build against it; see [SDK install](/docs/sdk/android/install). **iOS is pending** — this guide is Android.

## 1. Bootstrap

Create the `ExtentosGlasses` handle once at startup (after `RECORD_AUDIO` is granted) and start your handler. `ExtentosGlasses.create()` is synchronous — no coroutine needed for it; the coroutine is for the assistant's suspend `start` / `wake`, launched from inside the handler.

```kotlin
class MyApp : Application() {
    lateinit var glasses: ExtentosGlasses

    override fun onCreate() {
        super.onCreate()
        glasses = ExtentosGlasses.create(ExtentosConfig(applicationContext = this))
        // No OpenAI key anywhere — the assistant runs on the Extentos managed
        // gateway by default. To use your own key, add it in the dashboard
        // Credentials section; the gateway swaps it in server-side.
        RunAssistantHandler(glasses, routeTracker, library).start()
    }
}
```

There are no ONNX models, model paths, or opt-in config — the assistant is always available and runs end-to-end over the provider's WebSocket. See [the managed gateway](/docs/concepts/ai-gateway) for the AI plumbing.

## 2. Start the session and register tools

The whole assistant is one block. Declare `instructions`, register tools, and pick a greeting. Leave the provider as `AssistantProvider.OpenAi()` to take your dashboard's model + voice (defaults `gpt-realtime-2` / `alloy`).

```kotlin
import com.extentos.glasses.core.ExtentosGlasses
import com.extentos.glasses.core.RuntimeEvent
import com.extentos.glasses.core.VideoClip
import com.extentos.glasses.core.VideoConfig
import com.extentos.glasses.core.valueOrNull
import com.extentos.glasses.core.assistant.AssistantEvent
import com.extentos.glasses.core.assistant.AssistantProvider
import com.extentos.glasses.core.assistant.AssistantSession
import com.extentos.glasses.core.assistant.Greeting
import com.extentos.glasses.core.assistant.ToolResult
import com.extentos.glasses.core.assistant.tool
import kotlinx.coroutines.*
import kotlinx.coroutines.flow.filterIsInstance
import kotlinx.coroutines.flow.launchIn
import kotlinx.coroutines.flow.onEach
import kotlin.time.Duration.Companion.seconds

class RunAssistantHandler(
    private val glasses: ExtentosGlasses,
    private val routeTracker: RouteTracker,   // your app state
    private val library: ClipLibrary,         // your app state
    private val scope: CoroutineScope = CoroutineScope(SupervisorJob() + Dispatchers.IO),
) {
    private var session: AssistantSession? = null
    private var activeVideo: Deferred<*>? = null

    fun start() {
        scope.launch {
            session = glasses.assistant.start(provider = AssistantProvider.OpenAi()) {
                instructions = """
                    You are a running companion. Help with route stats and capture
                    moments. Speak briefly — they're running. Don't narrate what
                    you're doing; just do it and confirm. When the user clearly
                    wants to stop talking, call end_conversation.
                """.trimIndent()

                // The SDK speaks this greeting automatically on every wake,
                // generated out-of-band from memory (never continues the prior chat).
                greeting = Greeting.Custom(
                    "Greet the runner warmly in one short sentence and ask how you can help."
                )
                // Deterministic backup sleep after 30s of user silence.
                sleepAfterSilence(30.seconds)

                // Read tools — instant data the model reads aloud.
                tool("get_route_remaining", "How much of the planned route is left, in km.") {
                    ToolResult.Ok("${'$'}{routeTracker.kmRemaining()} km remaining")
                }
                tool("get_average_pace", "The runner's current average pace in minutes per km.") {
                    ToolResult.Ok("${'$'}{routeTracker.avgPaceMinKm()} min per km")
                }

                // Action tools — side effects on your own state. The model
                // manages the take/stop pair from conversational context.
                tool("take_video", "Start recording a video clip of the runner's view.") {
                    if (activeVideo?.isActive == true) {
                        return@tool ToolResult.Err("a recording is already in progress")
                    }
                    activeVideo = scope.async {
                        glasses.camera.captureVideo(VideoConfig(maxDurationSeconds = 30))
                    }
                    ToolResult.Ok("recording started")
                }
                tool("stop_video", "Stop the current video recording.") {
                    val capture = activeVideo ?: return@tool ToolResult.Err("nothing was recording")
                    activeVideo = null
                    glasses.camera.stopVideo()                 // resumes the await naturally
                    @Suppress("UNCHECKED_CAST")
                    val clip = (capture as Deferred<com.extentos.glasses.core.ExtentosResult<VideoClip, *>>)
                        .await().valueOrNull() ?: return@tool ToolResult.Err("video capture failed")
                    library.add(clip)
                    ToolResult.Ok("video saved")
                }
            }

            // Wake mechanism — the same voice-trigger system as everywhere.
            // Defaults to VoiceScope.WhenDormant so it won't double-fire mid-chat.
            glasses.voice.onPhrase("hey coach") { session?.wake() }
        }
    }
}
```

A few things the model does for you here, that you'd otherwise hand-wire: it detects the wake intent, decides which tool to call from each `description`, speaks a "let me check…" filler while a tool runs (suppress per-tool with `blocking = true`), and ends the conversation via the hidden `end_conversation` tool (`endOnIntent` defaults `true`) when the user wraps up — no rigid "goodbye" phrase required.

> **Clean stop pattern.** To stop the recording, call `glasses.camera.stopVideo()` and `await()` the in-flight capture — it resumes with a partial clip. Don't `cancel()` the `Deferred`: Kotlin's cancelled state is sticky and `await()` would throw even when a partial exists.

## 3. Add a vision tool

`session.includeImage(uri)` adds a photo to the conversation and auto-triggers a spoken response in the model's voice. Reach the running session from a tool body via `glasses.assistant.activeSession` (non-null while a tool dispatches). Keep "describe" and "save" as **two tools** so the model can disambiguate — and call both back-to-back when the user wants both.

```kotlin
tool(
    "describe_scene",
    "Describe what the runner is looking at without saving the photo. " +
        "Call for 'what do you see' / 'describe this' / 'tell me about this'.",
) {
    val photo = glasses.camera.capturePhoto().valueOrNull()
        ?: return@tool ToolResult.Err("camera failed")
    val uri = photo.uri ?: return@tool ToolResult.Err("photo had no uri")
    glasses.assistant.activeSession?.includeImage(uri)
    ToolResult.Ok("looking")
}
```

Camera tools need the simulator browser tab attached when you test them — see step 5.

## 4. Capture transcripts (optional)

If you want the user's words for notes, captions, or a journal, subscribe to the shared runtime-event stream. Phase 4 events carry **verbatim** transcripts (yours to govern; document retention in your privacy policy).

```kotlin
glasses.runtime.events
    .filterIsInstance<RuntimeEvent.Assistant>()
    .onEach { evt ->
        (evt.event as? AssistantEvent.UserSpoke)?.let { notes.append(it.transcript) }
    }
    .launchIn(scope)
```

## 5. Test it in the simulator

The agent-driven loop closes without a human or hardware. Drive the **wake** with the real wake path, then inject a user utterance, then assert the tool fired:

```ts
// Wake exactly like a real user saying the phrase:
await injectTranscript({ sessionId, text: "hey coach" });
// ... wait for assistant.session_started in getEventLog(filter: "ai"), then:
const inj = await injectAssistantUtterance({ sessionId, text: "how much further?" });
await assertToolCalled({ sessionId, name: "get_route_remaining", sinceCursor: inj.watchCursor });
```

- `injectAssistantUtterance` drives **both** the `Mock` provider (deterministic, $0) and the real `OpenAi` provider with the same call — pick via a build flavor, never `useMock` in handler code.
- The session must be **Active** when you inject — wake it first and watch for `assistant.session_started`.
- Always thread `inj.watchCursor` into `assertToolCalled({ sinceCursor })` — the model fires the tool 0.5–2 s later, and anchoring the wait before the inject avoids a false "tool not called."
- Camera/vision tools (`describe_scene`) need the browser tab attached — call `ensureSimulatorBrowser({ sessionId })` first, and budget a longer `timeoutMs` (camera + upload + reasoning).

Lifecycle and transcripts land on the simulator's **`ai`** event-log chip (an assistant error climbs to **`errors`**). See [the MCP tools reference](/docs/reference/mcp-tools) for the full agent-test surface.

> **Sim fidelity.** The simulator runs the *same* library code as production with only the transport and audio path swapped, so it's designed to behave identically to hardware. That fidelity is under active validation on Android right now — treat the sim as the iteration loop and a real-hardware run as the final gate, not a guarantee.

## Where to go next

- **Render on the glasses screen.** On a Ray-Ban Display, a tool can show results visually instead of (or alongside) speaking — see [render on the display](/docs/guides/display); gate on `glasses.display.isAvailable`.
- **Tune the model.** `AssistantProvider.OpenAi(reasoningEffort = …)`, mid-session `setReasoningEffort` / `updateInstructions`, and `historyCompaction` for long conversations — all in [the assistant runtime](/docs/concepts/assistant).
- **Persist memory across sessions.** `persistentMemory = true` (opt-in, consent-gated, Android preview) — see [memory](/docs/concepts/assistant#memory).
- **Bring your own key.** Upload an OpenAI key in the dashboard to move spend to your account — see [the managed AI gateway](/docs/concepts/ai-gateway).