Voice triggers

Wire a voice command on the glasses to an action in your app. Works on Meta Ray-Ban via the phone's speech recognizer over Bluetooth. Phrases auto-surface on the connection page and the simulator's click-to-fire panel.

Camera and display need the Meta vendor module. com.extentos:glasses carries no vendor SDK, so add implementation("com.extentos:glasses-meta") alongside it — see install. Without it the build still succeeds and voice still works, but capabilities.camera is false and captures return errors. The SDK logs a warning at startup when it spots that combination.

A voice trigger on Extentos is just a Kotlin lambda or a Swift closure that runs when the user says a phrase. There is no spec file, no trigger / block tree, no special registration step. Your handler class calls glasses.voice.onPhrase("X") { ... } and the library does the rest: subscribes to the phone's speech recognizer, matches the phrase against incoming transcripts, runs your handler, and surfaces the phrase on the host-app connection page and the simulator's right-rail click-to-fire panel.

This page covers the canonical pattern, the stops cancellation primitive, the three usage tiers (when to drop down to raw transcripts), and the per-vendor caveats.

The canonical pattern

class VisionHandler(private val glasses: ExtentosGlasses) {
    private var registration: VoiceRegistration? = null

    fun start() {
        registration = glasses.voice.onPhrase(
            phrase = "describe what you see",
            label = "Describe scene",
        ) {
            // Your handler — runs when the user says the phrase.
            val photo = glasses.camera.capturePhoto().valueOrNull() ?: return@onPhrase
            // ...vision LLM + speak the description...
        }
    }

    fun stop() { registration?.cancel() }
}

final class VisionHandler: @unchecked Sendable {
    private let glasses: any ExtentosGlasses
    private var registration: VoiceRegistration?

    init(glasses: any ExtentosGlasses) { self.glasses = glasses }

    func start() {
        registration = glasses.voice.onPhrase(
            phrase: "describe what you see",
            label: "Describe scene",
            stops: []
        ) { [glasses] in
            let result = await glasses.camera.capturePhoto()
            guard case .success(let photo) = result else { return }
            // ...vision LLM + speak the description...
        }
    }

    func stop() { registration?.cancel() }
}

That's everything. No register table, no callback ID, no spec edit. The library matches the phrase against glasses.audio.transcriptions() under the hood — case-insensitive substring on FINAL transcripts. The handler runs under the library's coroutine scope (Android) / Task (iOS). Returning from the handler closes the trigger run; the next utterance of the phrase fires it again.

Scope note (Android). On Android, onPhrase takes a firesWhen parameter (type VoiceScope) that defaults to VoiceScope.WhenDormant; the iOS onPhrase has no such parameter (iOS ships the assistant runtime but hasn't wired VoiceScope gating into its voice client yet). A registered phrase therefore fires only while the assistant is dormant — it does not fire during an active assistant session, so your wake phrases don't race with the model's tool dispatch mid-conversation. Pass firesWhen = VoiceScope.Always for a utility command (e.g. a "cancel everything" panic phrase) that must fire regardless of assistant state.

Stop conditions

A second voice command that cancels the first one mid-flow:

glasses.voice.onPhrase(
    phrase = "play cat video",
    label = "Play cat video",
    stops = listOf("stop the video"),
) {
    catPlayer.play()   // suspending — cancelled when user says "stop the video"
}

registration = glasses.voice.onPhrase(
    phrase: "play cat video",
    label: "Play cat video",
    stops: ["stop the video"]
) { [catPlayer] in
    await catPlayer.play()   // cancelled when user says "stop the video"
}

stops is a list of phrases that, while the handler is running, will cancel the handler's coroutine / Task. The cancellation is plain Kotlin structured concurrency / Swift Task.cancel() — your try/finally or defer blocks run normally. For cleanup that itself suspends (releasing a MediaPlayer, draining a queue, sending a final speak), wrap it in withContext(NonCancellable) { ... } (Kotlin) or use a detached cleanup Task (Swift).

stops is also a UI affordance: the simulator renders the stop phrases as nested STOP rows under the parent VOICE card, gated by is_active so you can only click them while the parent is running. The host-app connection page renders them as indented italic rows under the parent ("Say to me" section).

The pattern works for any process — playing a video, muting the speaker, starting a recording, running an LLM call. The library doesn't know what the action is; the stops list is just metadata + a cancellation signal.

Three usage tiers

Tier	Customer code	Auto-display	Auto-cancel on stop
1 (default)	`glasses.voice.onPhrase(phrase, label, stops) { ... }`	✅	✅
2 (custom match, visible)	raw `transcriptions()` + `glasses.voice.registerHint(phrase, label, stops)`	✅	manual
3 (raw / hidden)	raw `transcriptions().collect { ... }` only	❌ (by choice)	manual

Drop to tier 2 when the substring match isn't expressive enough — you need regex, state machines, or per-utterance guards. Drop to tier 3 only when you actively don't want the phrase visible on the connection page or in the simulator (rare).

// Tier 2 — raw matching with UI affordance.
//
// registerHint returns a VoiceRegistration, which exposes ONLY cancel() —
// it carries no id. The hint's id lives on the VoiceHint published by the
// `glasses.voice.hints` StateFlow; read it from there to call reportFired().
val registration = glasses.voice.registerHint(
    phrase = "translate this",
    label = "Translate",
)
val hintId = glasses.voice.hints.value
    .firstOrNull { it.phrase == "translate this" }
    ?.id

glasses.audio.transcriptions().collect { t ->
    if (t !is Transcript.Final) return@collect
    if (myRegex.matches(t.text)) {
        hintId?.let { glasses.voice.reportFired(it) }  // keep the sim's "fired N times" honest
        handleMatch(t.text)
    }
}

// ...later, to tear the hint down:
registration.cancel()

Per-vendor caveats (Meta Ray-Ban)

Hey Meta is not exposed. Meta's system wake word goes to Meta AI; third-party apps cannot intercept it. Your wake phrase is whatever string you match against transcriptions() (directly or via onPhrase).
The phone is the recognizer. Audio captures on the glasses, streams over Bluetooth HFP/SCO to the phone, and the phone runs the recognizer — on-device Vosk on Android hardware, Apple's SFSpeechRecognizer on iOS hardware, and the platform recognizer (SpeechRecognizer/SFSpeechRecognizer) on the dev/sim path. No STT on the glasses themselves.
listening_mode toggle gates STT. When the user flips Voice Activation off on the connection page, transcriptions() stops emitting and onPhrase handlers never fire. This is intentional — it's the user's hard kill-switch.

Testing in the simulator

Open extentos.com/s/[sessionId] after createSimulatorSession. The right rail shows one card per registered phrase:

VOICE pill + the phrase + the registration's stable id + fired N times · Xs ago.
Nested STOP rows under each parent with declared stops, disabled until the parent is is_active = true.
Click the card to inject the phrase as a synthetic stt_transcript — the dispatch path is identical to a real utterance, so any matcher you wrote runs the same way in sim and on real glasses.

Clicking a card in the browser injects a synthetic stt_transcript and runs the same dispatch code path a real spoken phrase would — so the matcher you wrote is exercised identically. The simulator is designed to behave like the glasses here; the substrate still differs (the phone-side recognizer — on-device Vosk on Android hardware, SFSpeechRecognizer on iOS — over Bluetooth HFP produces the real transcripts, vs the browser's recognizer in sim), and that fidelity is under active validation on hardware. Treat a green sim run as strong evidence, then confirm phrase recognition on real glasses — see Wake phrase not matching for the variance that only shows up on hardware.

Common gotchas

Substring matching is forgiving. onPhrase("stop") matches "I'm going to stop now" and "let me think, stop right there." Phrase carefully or use tier-2 raw matching with a tighter regex.
Overlapping phrases each fire. onPhrase("start") and onPhrase("start recording") both fire on "start recording now" — each registration is independent. Order your phrases from most-specific to least, and have your handlers guard for the overlap if it matters.
Stops only apply during handler execution. Saying "stop the video" before any "play cat video" is dead text — no parent handler running means no listener attached. The simulator UI gates the STOP rows by parent.is_active for exactly this reason.
registerHint does NOT auto-cancel. Stops cancellation is exclusive to onPhrase. If you're writing a tier-2 handler, write the cancellation yourself.

getCapabilityGuide(feature: "voice_command") — minimal Kotlin + Swift snippets + the full gotcha list.
getCodeExample(pattern: "voice_qa_assistant") — multi-turn wake → speak → record → LLM → speak loop, the canonical voice-glasses flow.
getCodeExample(pattern: "barge_in_speak") — cancel TTS the moment the user starts talking.
Android API reference — the full glasses.voice sub-client surface (onPhrase, registerHint, reportFired, the hints/stats StateFlows, VoiceScope).
Assistant runtime — when you want the model to own wake / turn-taking instead of matching phrases yourself; onPhrase { session.wake() } is the canonical assistant wake.
Wake phrase not matching — the toggle/casing/recognizer checklist when a phrase never fires.

Voice triggers

The canonical pattern

Stop conditions

Three usage tiers

Per-vendor caveats (Meta Ray-Ban)

Testing in the simulator

Common gotchas

Capabilities

The assistant runtime

Wake phrase not matching

Android API reference

On this page

Voice triggers

The canonical pattern

Stop conditions

Three usage tiers

Per-vendor caveats (Meta Ray-Ban)

Testing in the simulator

Common gotchas

Related

Related

Capabilities

The assistant runtime

Wake phrase not matching

Android API reference

On this page