Build a voice assistant

Build a wake-word voice assistant on Meta Ray-Ban smart glasses with glasses.assistant.start. A phone-side wake phrase opens the conversation; the model owns turn-taking and intent parsing; you write tool bodies that read and act on your app's state, add a vision tool for "what am I looking at", and let the conversation sleep on intent. Runs on the Extentos managed gateway with no API key in your app.

Camera needs the Meta vendor module. com.extentos:glasses carries no vendor SDK, so add implementation("com.extentos:glasses-meta") alongside it — see install §1. Without it your build still succeeds and voice still works, but capabilities.camera is false and every capture returns an error. The SDK logs a warning at startup when it spots that combination.

This guide builds a working voice assistant: a wake phrase opens a conversation, the model calls tools that read and mutate your app's own state (route stats, a clip library, the camera), a vision tool answers "what am I looking at," and the conversation sleeps when the user is done. A phone-side wake phrase (on-device recognition) opens the conversation; the model handles turn-taking, intent parsing, and confirmation speech — you only write the tool bodies.

For the full API and the wake/sleep state machine, read the assistant runtime first. This page is the task walkthrough.

glasses.assistant.* ships on both platforms — in com.extentos:glasses since 1.4.0 (SDK install) and in the Extentos Swift package — a standard dependency, no snapshot setup. Kotlin leads the snippets here, with the Swift counterpart in a fold under each. The fields and behaviour are the same on both; a handful of spellings differ, and those are called out where they appear rather than left for you to discover at the compiler.

1. Bootstrap

The assistant needs the microphone, and RECORD_AUDIO is a runtime permission your app owns — the library never requests it for you. So the order matters, and it is the part people get wrong:

Create the ExtentosGlasses handle in Application.onCreate, but start the assistant only after the grant. Creating the handle early is fine and cheap; starting a session before the grant leaves you with an assistant that is running and deaf.

MyApp.kt

class MyApp : Application() {
    lateinit var glasses: ExtentosGlasses

    override fun onCreate() {
        super.onCreate()
        // Cheap: resolves the transport, wires the clients. No mic yet.
        // No OpenAI key anywhere — the assistant runs on the Extentos managed
        // gateway (there is no option to run it on your own OpenAI account).
        glasses = ExtentosGlasses.create(ExtentosConfig(applicationContext = this))
    }
}

MainActivity.kt

class MainActivity : ComponentActivity() {

    private val micPermission = registerForActivityResult(
        ActivityResultContracts.RequestPermission(),
    ) { granted -> if (granted) onMicGranted() }

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        if (checkSelfPermission(Manifest.permission.RECORD_AUDIO) == PackageManager.PERMISSION_GRANTED) {
            onMicGranted()
        } else {
            micPermission.launch(Manifest.permission.RECORD_AUDIO)
        }
    }

    private fun onMicGranted() {
        // Only now: the foreground service (so the mic survives backgrounding)
        // and the assistant handler.
        GlassesForegroundService.start(this)
        // Your handler — step 2 defines it. routeTracker / library are your own app state.
        RunAssistantHandler((application as MyApp).glasses, routeTracker, library).start()
    }
}

Do not start the assistant or the foreground service from Application.onCreate. On first launch the grant cannot exist that early, and Android 14+ rejects a microphone-type foreground service started without it. See lifecycle.

If you render the packaged ExtentosConnectionPage it handles this flow for you as part of pairing — but a voice-only app has no connection page, so the sequence above is yours to write.

The same bootstrap in Swift

iOS needs no foreground service — UIBackgroundModes: ["audio"] in your Info.plist covers background listening. Two things differ from Android and both matter:

usedCapabilities picks the transport. Leave it empty and .auto resolves to real glasses, so a voice app dies with NoEligibleDevice on a phone with none. Declare [.microphone, .speaker].
The microphone grant is yours. requestSpeechRecognitionAuthorization() covers speech recognition only, not audio capture.

InterpreterApp.swift

import AVFAudio
import GlassesCore
import SwiftUI

@main
struct MyApp: App {
    let glasses: any ExtentosGlasses

    init() {
        glasses = Extentos.create(config: ExtentosConfig(
            usedCapabilities: [.microphone, .speaker]
        ))
        // `debug` is left at its false default — setting it true opens a pending
        // browser-simulator socket and waits for the MCP bridge to bind it.
    }

    var body: some Scene {
        WindowGroup { ContentView(glasses: glasses) }
    }
}

Starting the assistant, after both grants

func startAssistant(_ glasses: any ExtentosGlasses) async {
    // AVAudioApplication is iOS 17+; on an iOS 16 target use
    // AVAudioSession.sharedInstance().requestRecordPermission { … } instead.
    let mic = await AVAudioApplication.requestRecordPermission()
    let speech = await Extentos.requestSpeechRecognitionAuthorization()
    guard mic, speech else { return }   // surface your own explanation

    await RunAssistantHandler(glasses: glasses).start()
}

If your app also uses the camera

The tools, the session block and the wake phrase are unchanged. The bootstrap is not — one thing moves and two things are added, all consequences of the same fact: TransportChoice.Auto picks the transport inside create(...) and keeps it for the life of the handle. Whatever is true at that moment is what your app gets.

What moves: create(...) leaves Application.onCreate. In the voice bootstrap above it can run at app start, because nothing it depends on needs a grant. A camera app can't do that — Auto needs BLUETOOTH_CONNECT in hand to see the glasses, and on first launch no grant exists that early. So the handle becomes lateinit and is built from the permission callback instead. Keeping the eager create() is the mistake this section exists to prevent: it resolves to the audio baseline, capture returns errors forever, and nothing throws.

1. Grant BLUETOOTH_CONNECT before create(...), not after. The Meta transport claims the session by checking whether Meta glasses are bonded to the phone, and reading the bonded set needs BLUETOOTH_CONNECT on Android 12+. Without the grant the check can't answer, Auto moves down to the audio baseline, and capabilities.camera reports false for the rest of the process — voice works, capture doesn't. Restarting the app after the grant fixes it, but you don't want the first run to be the broken one.

2. Give create(...) an activityProvider. Meta's registration handoff needs a live Activity to present its consent UI, and the SDK requests BLUETOOTH_CONNECT and CAMERA at runtime through that same Activity. It defaults to null, and with no connection page nothing else supplies one.

MyApp.kt — the camera variant

class MyApp : Application() {
    lateinit var glasses: ExtentosGlasses
    private val topActivity = java.util.concurrent.atomic.AtomicReference<Activity?>()

    override fun onCreate() {
        super.onCreate()
        registerActivityLifecycleCallbacks(object : ActivityLifecycleCallbacks {
            override fun onActivityResumed(activity: Activity) { topActivity.set(activity) }
            override fun onActivityPaused(activity: Activity) {
                if (topActivity.get() === activity) topActivity.set(null)
            }
            // remaining callbacks: no-op
            override fun onActivityCreated(a: Activity, b: Bundle?) {}
            override fun onActivityStarted(a: Activity) {}
            override fun onActivityStopped(a: Activity) {}
            override fun onActivitySaveInstanceState(a: Activity, b: Bundle) {}
            override fun onActivityDestroyed(a: Activity) {}
        })
    }

    /** Called from the Activity once the permissions below are granted. */
    fun initGlasses() {
        if (::glasses.isInitialized) return
        glasses = ExtentosGlasses.create(
            ExtentosConfig(
                applicationContext = this,
                activityProvider = { topActivity.get() },
                environment = if (BuildConfig.DEBUG) ExtentosEnvironment.DEVELOPMENT
                              else ExtentosEnvironment.PRODUCTION,
            )
        )
    }
}

MainActivity.kt — the camera variant

class MainActivity : ComponentActivity() {

    private val permissions = registerForActivityResult(
        ActivityResultContracts.RequestMultiplePermissions(),
    ) { result ->
        // Create only after the answer, whatever it was — a denied BLUETOOTH_CONNECT
        // still resolves correctly (to the audio baseline), it just isn't a surprise.
        val app = application as MyApp
        app.initGlasses()
        Log.i("Extentos", "transport=${app.glasses.transportChosen} camera=${app.glasses.capabilities.camera}")
        if (result[Manifest.permission.RECORD_AUDIO] == true) onMicGranted()
    }

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        permissions.launch(
            arrayOf(
                Manifest.permission.RECORD_AUDIO,
                Manifest.permission.CAMERA,
                Manifest.permission.BLUETOOTH_CONNECT,
            )
        )
    }

    private fun onMicGranted() {
        GlassesForegroundService.start(this)
        // Your handler — step 2 defines it.
        RunAssistantHandler((application as MyApp).glasses, routeTracker, library).start()
    }
}

Then assert what you got, once, at startup — this is the check that turns a silent capability loss into a one-line log:

Log.i("Extentos", "transport=${glasses.transportChosen} camera=${glasses.capabilities.camera}")

A camera app expects REAL_META / true. SYSTEM_AUDIO / false means Auto fell through to the audio baseline — either the com.extentos:glasses-meta dependency is missing, the glasses aren't bonded, or BLUETOOTH_CONNECT wasn't granted before create(...).

With the base dependency there are no model files or model paths to manage — the assistant is available immediately and runs end-to-end over the gateway's WebSocket. (On-device models are an opt-in you add deliberately: two extra modules and a register() call. See local models.) See the managed gateway for the AI plumbing.

2. Start the session and register tools

The whole assistant is one block. Declare instructions, register tools, and pick a greeting. Leave the provider as AssistantProvider.Managed() to take your dashboard's model + voice (defaults gpt-realtime-2 / alloy).

import com.extentos.glasses.core.ExtentosGlasses
import com.extentos.glasses.core.RuntimeEvent
import com.extentos.glasses.core.VideoClip
import com.extentos.glasses.core.VideoConfig
import com.extentos.glasses.core.valueOrNull
import com.extentos.glasses.core.assistant.AssistantEvent
import com.extentos.glasses.core.assistant.AssistantError
import com.extentos.glasses.core.assistant.AssistantException
import com.extentos.glasses.core.assistant.AssistantProvider
import com.extentos.glasses.core.assistant.AssistantSession
import com.extentos.glasses.core.assistant.Greeting
import com.extentos.glasses.core.assistant.ToolResult
import com.extentos.glasses.core.assistant.tool
import kotlinx.coroutines.*
import kotlinx.coroutines.flow.filterIsInstance
import kotlinx.coroutines.flow.launchIn
import kotlinx.coroutines.flow.onEach
import kotlin.time.Duration.Companion.seconds

class RunAssistantHandler(
    private val glasses: ExtentosGlasses,
    private val routeTracker: RouteTracker,   // your app state
    private val library: ClipLibrary,         // your app state
    private val scope: CoroutineScope = CoroutineScope(SupervisorJob() + Dispatchers.IO),
) {
    private var session: AssistantSession? = null
    private var activeVideo: Deferred<*>? = null

    fun start() {
        scope.launch {
            session = glasses.assistant.start(provider = AssistantProvider.Managed()) {
                instructions = """
                    You are a running companion. Help with route stats and capture
                    moments. Speak briefly — they're running. Don't narrate what
                    you're doing; just do it and confirm. When the user clearly
                    wants to stop talking, call end_conversation.
                """.trimIndent()

                // The SDK speaks this greeting automatically on every wake,
                // generated out-of-band from memory (never continues the prior chat).
                greeting = Greeting.Custom(
                    "Greet the runner warmly in one short sentence and ask how you can help."
                )
                // Deterministic backup sleep after 30s of user silence.
                sleepAfterSilence(30.seconds)

                // Read tools — instant data the model reads aloud.
                tool("get_route_remaining", "How much of the planned route is left, in km.") {
                    ToolResult.Ok("${routeTracker.kmRemaining()} km remaining")
                }
                tool("get_average_pace", "The runner's current average pace in minutes per km.") {
                    ToolResult.Ok("${routeTracker.avgPaceMinKm()} min per km")
                }

                // Action tools — side effects on your own state. The model
                // manages the take/stop pair from conversational context.
                tool("take_video", "Start recording a video clip of the runner's view.") {
                    if (activeVideo?.isActive == true) {
                        return@tool ToolResult.Err("a recording is already in progress")
                    }
                    activeVideo = scope.async {
                        glasses.camera.captureVideo(VideoConfig(maxDurationSeconds = 30))
                    }
                    ToolResult.Ok("recording started")
                }
                tool("stop_video", "Stop the current video recording.") {
                    val capture = activeVideo ?: return@tool ToolResult.Err("nothing was recording")
                    activeVideo = null
                    glasses.camera.stopVideo()                 // resumes the await naturally
                    @Suppress("UNCHECKED_CAST")
                    val clip = (capture as Deferred<com.extentos.glasses.core.ExtentosResult<VideoClip, *>>)
                        .await().valueOrNull() ?: return@tool ToolResult.Err("video capture failed")
                    library.add(clip)
                    ToolResult.Ok("video saved")
                }
            }

            // Wake mechanism — the same voice-trigger system as everywhere.
            // Defaults to VoiceScope.WhenDormant so it won't double-fire mid-chat.
            glasses.voice.onPhrase("hey coach") { session?.wake() }
        }
    }
}

assistant.start(...) throws — it's the exception to the "results never throw" rule. Capability calls return ExtentosResult, but an open-time failure raises AssistantException wrapping an AssistantError. In the bare scope.launch { … } above that would kill the coroutine and take the app with it, and the most common trigger is the ordinary one: a sideloaded build reaching the gateway without a baked project key gives you NoApiKey. Wrap it:

session = try {
    glasses.assistant.start(provider = AssistantProvider.Managed()) { /* … */ }
} catch (e: AssistantException) {
    Log.e("MyApp", "assistant didn't start: ${e.error}", e)
    return@launch          // surface it in your UI; don't let it propagate
}

AssistantException and AssistantError live in com.extentos.glasses.core.assistant. Variants and their meanings are in the error reference.

The same session in Swift

Same fields, same behaviour — four spellings differ. The config block is a closure taking the builder (so $0.), sleepAfterSilence takes a TimeInterval in seconds rather than a Duration, tool results are .ok / .err, and tool labels its description.

import GlassesCore

final class RunAssistantHandler {
    private let glasses: any ExtentosGlasses
    private var session: (any AssistantSession)?

    init(glasses: any ExtentosGlasses) { self.glasses = glasses }

    func start() async {
        do {
            session = try await glasses.assistant.start(
                provider: .managed()          // .managed(model: "local-auto") to run on-device
            ) {
                $0.instructions = """
                    You are a running companion. Help with route stats and capture
                    moments. Speak briefly — they're running.
                    """
                $0.greeting = .custom(directive: "Greet in one short sentence.")
                $0.sleepAfterSilence(30)      // seconds, not a Duration
                $0.endOnIntent = true

                $0.tool("log_split", description: "Record a lap split for the current run.") {
                    .ok("split recorded")
                }
            }
        } catch let error as AssistantError {
            // start() throws on iOS too. .noApiKey on a sideloaded build means the
            // gateway couldn't identify the caller — see the Info.plist project key.
            print("assistant didn't start: \(error)")
            return
        } catch {
            return
        }

        // No `firesWhen` on Swift — the wake phrase can fire mid-conversation,
        // so gate re-wakes yourself if that matters.
        _ = glasses.voice.onPhrase(phrase: "hey coach", label: "Wake", stops: []) { [weak self] in
            try? await self?.session?.wake()
        }
    }
}

A vision tool is the same shape; note includeImage labels both arguments and prompt has no default on Swift:

$0.tool("describe_scene", description: "Describe what the user is looking at.") { [weak self] in
    guard let self else { return .err("handler went away") }
    let result = await self.glasses.camera.capturePhoto()
    guard case .success(let photo) = result, let uri = photo.uri else {
        return .err("couldn't take a photo right now")
    }
    try? await self.glasses.assistant.activeSession?.includeImage(uri: uri, prompt: nil)
    return .ok("looking")
}

A few things the model does for you here, that you'd otherwise hand-wire: it decides which tool to call from each description, speaks a "let me check…" filler while a tool runs (suppress per-tool with blocking = true), and ends the conversation via the hidden end_conversation tool (endOnIntent defaults true) when the user wraps up — no rigid "goodbye" phrase required.

Clean stop pattern. To stop the recording, call glasses.camera.stopVideo() and await() the in-flight capture — it resumes with a partial clip. Don't cancel() the Deferred: Kotlin's cancelled state is sticky and await() would throw even when a partial exists.

3. Add a vision tool

session.includeImage(uri) adds a photo to the conversation and auto-triggers a spoken response in the model's voice. Reach the running session from a tool body via glasses.assistant.activeSession (non-null while a tool dispatches). Keep "describe" and "save" as two tools so the model can disambiguate — and call both back-to-back when the user wants both.

import com.extentos.glasses.core.assistant.orToolError

tool(
    "describe_scene",
    "Describe what the runner is looking at without saving the photo. " +
        "Call for 'what do you see' / 'describe this' / 'tell me about this'.",
) {
    val photo = glasses.camera.capturePhoto().orToolError { return@tool it }
    val uri = photo.uri ?: return@tool ToolResult.Err("photo had no uri")
    glasses.assistant.activeSession?.includeImage(uri)
    ToolResult.Ok("looking")
}

orToolError { return@tool it } unwraps the capture, and on failure short-circuits the tool with a ToolResult.Err that carries the failure's actionable message — so the model relays a real reason instead of a generic "camera failed". If the wearer paused the camera (a temple tap), the model hears "tap the right temple of your glasses to resume the camera" and can tell the user exactly that. It's a thin helper over the typed CaptureError; pattern-match the ExtentosResult yourself if you want custom copy.

Camera tools need the simulator browser tab attached when you test them — see step 5.

4. Capture transcripts (optional)

If you want the user's words for notes, captions, or a journal, subscribe to the shared runtime-event stream. Assistant events carry verbatim transcripts (yours to govern; document retention in your privacy policy).

glasses.runtime.events
    .filterIsInstance<RuntimeEvent.Assistant>()
    .onEach { evt ->
        (evt.event as? AssistantEvent.UserSpoke)?.let { notes.append(it.transcript) }
    }
    .launchIn(scope)

5. Test it in the simulator

The agent-driven loop closes without a human or hardware. Drive the wake with the real wake path, then inject a user utterance, then assert the tool fired:

// Wake exactly like a real user saying the phrase:
await injectTranscript({ sessionId, text: "hey coach" });
// ... wait for assistant.session_started in getEventLog(filter: "voice"), then:
const inj = await injectAssistantUtterance({ sessionId, text: "how much further?" });
await assertToolCalled({ sessionId, name: "get_route_remaining", sinceCursor: inj.watchCursor });

injectAssistantUtterance drives both the Mock provider (deterministic, $0) and the real OpenAi provider with the same call — pick via a build flavor, never useMock in handler code.
The session must be Active when you inject — wake it first and watch for assistant.session_started.
Always thread inj.watchCursor into assertToolCalled({ sinceCursor }) — the model fires the tool 0.5–2 s later, and anchoring the wait before the inject avoids a false "tool not called."
Camera/vision tools (describe_scene) need the browser tab attached — call ensureSimulatorBrowser({ sessionId }) first, and budget a longer timeoutMs (camera + upload + reasoning).

Lifecycle and transcripts land on the simulator's voice event-log chip (an assistant error climbs to errors). The ai chip is only for customer-side BYOK calls wrapped in glasses.observability.aiCall(...) — don't watch it for assistant events. See the MCP tools reference for the full agent-test surface.

Sim fidelity. The simulator runs the same library code as production with only the transport and audio path swapped, so it's designed to behave identically to hardware. That fidelity is under active validation on Android right now — treat the sim as the iteration loop and a real-hardware run as the final gate, not a guarantee.

Where to go next

Render on the glasses screen. On a Ray-Ban Display, a tool can show results visually instead of (or alongside) speaking — see render on the display; gate on glasses.display.isAvailable.
Tune the model. AssistantProvider.Managed(reasoningEffort = …), mid-session setReasoningEffort / updateInstructions, and historyCompaction for long conversations — all in the assistant runtime.
Persist memory across sessions. persistentMemory = true (opt-in, consent-gated) — see memory.

Build a voice assistant

1. Bootstrap

If your app also uses the camera

2. Start the session and register tools

3. Add a vision tool

4. Capture transcripts (optional)

5. Test it in the simulator

Where to go next

The assistant runtime

The managed AI gateway

Voice triggers

Render on the display

Capabilities

On this page

Build a voice assistant

1. Bootstrap

If your app also uses the camera

2. Start the session and register tools

3. Add a vision tool

4. Capture transcripts (optional)

5. Test it in the simulator

Where to go next

Related

The assistant runtime

The managed AI gateway

Voice triggers

Render on the display

Capabilities

On this page