Guides

Voice triggers

Wire a voice command on the glasses to an action in your app. Works on Meta Ray-Ban via the phone's speech recognizer over Bluetooth. Phrases auto-surface on the connection page and the simulator's click-to-fire panel.

A voice trigger on Extentos is just a Kotlin lambda or a Swift closure that runs when the user says a phrase. There is no spec file, no trigger / block tree, no special registration step. Your handler class calls glasses.voice.onPhrase("X") { ... } and the library does the rest: subscribes to the phone's speech recognizer, matches the phrase against incoming transcripts, runs your handler, and surfaces the phrase on the host-app connection page and the simulator's right-rail click-to-fire panel.

This page covers the canonical pattern, the stops cancellation primitive, the three usage tiers (when to drop down to raw transcripts), and the per-vendor caveats.

The canonical pattern

class VisionHandler(private val glasses: ExtentosGlasses) {
    private var registration: VoiceRegistration? = null

    fun start() {
        registration = glasses.voice.onPhrase(
            phrase = "describe what you see",
            label = "Describe scene",
        ) {
            // Your handler — runs when the user says the phrase.
            val photo = glasses.camera.capturePhoto().valueOrNull() ?: return@onPhrase
            // ...vision LLM + speak the description...
        }
    }

    fun stop() { registration?.cancel() }
}
final class VisionHandler: @unchecked Sendable {
    private let glasses: any ExtentosGlasses
    private var registration: VoiceRegistration?

    init(glasses: any ExtentosGlasses) { self.glasses = glasses }

    func start() {
        registration = glasses.voice.onPhrase(
            phrase: "describe what you see",
            label: "Describe scene"
        ) { [glasses] in
            let result = await glasses.camera.capturePhoto()
            guard case .success(let photo) = result else { return }
            // ...vision LLM + speak the description...
        }
    }

    func stop() { registration?.cancel() }
}

That's everything. No register table, no callback ID, no spec edit. The library matches the phrase against glasses.audio.transcriptions() under the hood — case-insensitive substring on FINAL transcripts. The handler runs under the library's coroutine scope (Android) / Task (iOS). Returning from the handler closes the trigger run; the next utterance of the phrase fires it again.

Stop conditions

A second voice command that cancels the first one mid-flow:

glasses.voice.onPhrase(
    phrase = "play cat video",
    label = "Play cat video",
    stops = listOf("stop the video"),
) {
    catPlayer.play()   // suspending — cancelled when user says "stop the video"
}
registration = glasses.voice.onPhrase(
    phrase: "play cat video",
    label: "Play cat video",
    stops: ["stop the video"]
) { [catPlayer] in
    await catPlayer.play()   // cancelled when user says "stop the video"
}

stops is a list of phrases that, while the handler is running, will cancel the handler's coroutine / Task. The cancellation is plain Kotlin structured concurrency / Swift Task.cancel() — your try/finally or defer blocks run normally. For cleanup that itself suspends (releasing a MediaPlayer, draining a queue, sending a final speak), wrap it in withContext(NonCancellable) { ... } (Kotlin) or use a detached cleanup Task (Swift).

stops is also a UI affordance: the simulator renders the stop phrases as nested STOP rows under the parent VOICE card, gated by is_active so you can only click them while the parent is running. The host-app connection page renders them as indented italic rows under the parent ("Say to me" section).

The pattern works for any process — playing a video, muting the speaker, starting a recording, running an LLM call. The library doesn't know what the action is; the stops list is just metadata + a cancellation signal.

Three usage tiers

TierCustomer codeAuto-displayAuto-cancel on stop
1 (default)glasses.voice.onPhrase(phrase, label, stops) { ... }
2 (custom match, visible)raw transcriptions() + glasses.voice.registerHint(phrase, label, stops)manual
3 (raw / hidden)raw transcriptions().collect { ... } only❌ (by choice)manual

Drop to tier 2 when the substring match isn't expressive enough — you need regex, state machines, or per-utterance guards. Drop to tier 3 only when you actively don't want the phrase visible on the connection page or in the simulator (rare).

// Tier 2 — raw matching with UI affordance:
val hint = glasses.voice.registerHint(
    phrase = "translate this",
    label = "Translate",
)
glasses.audio.transcriptions().collect { t ->
    if (t !is Transcript.Final) return@collect
    if (myRegex.matches(t.text)) {
        glasses.voice.reportFired(hint.id)  // keep stats honest
        handleMatch(t.text)
    }
}

Per-vendor caveats (Meta Ray-Ban)

  • Hey Meta is not exposed. Meta's system wake word goes to Meta AI; third-party apps cannot intercept it. Your wake phrase is whatever string you match against transcriptions() (directly or via onPhrase).
  • The phone is the recognizer. Audio captures on the glasses, streams over Bluetooth HFP/SCO to the phone, and the phone's SpeechRecognizer (Android) / SFSpeechRecognizer (iOS) emits transcripts. No on-device STT on the glasses themselves.
  • listening_mode toggle gates STT. When the user flips Voice Activation off on the connection page, transcriptions() stops emitting and onPhrase handlers never fire. This is intentional — it's the user's hard kill-switch.

Testing in the simulator

Open extentos.com/s/[sessionId] after createSimulatorSession. The right rail shows one card per registered phrase:

  • VOICE pill + the phrase + the registration's stable id + fired N times · Xs ago.
  • Nested STOP rows under each parent with declared stops, disabled until the parent is is_active = true.
  • Click the card to inject the phrase as a synthetic stt_transcript — the dispatch path is identical to a real utterance, so any matcher you wrote runs the same way in sim and on real glasses.

The simulator parity guarantee is strict: clicking a card in the browser fires the same code path a real spoken phrase would. If your handler works in sim, it works on the glasses.

Common gotchas

  • Substring matching is forgiving. onPhrase("stop") matches "I'm going to stop now" and "let me think, stop right there." Phrase carefully or use tier-2 raw matching with a tighter regex.
  • Overlapping phrases each fire. onPhrase("start") and onPhrase("start recording") both fire on "start recording now" — each registration is independent. Order your phrases from most-specific to least, and have your handlers guard for the overlap if it matters.
  • Stops only apply during handler execution. Saying "stop the video" before any "play cat video" is dead text — no parent handler running means no listener attached. The simulator UI gates the STOP rows by parent.is_active for exactly this reason.
  • registerHint does NOT auto-cancel. Stops cancellation is exclusive to onPhrase. If you're writing a tier-2 handler, write the cancellation yourself.
  • getCapabilityGuide(feature: "voice_command") — minimal Kotlin + Swift snippets + the full gotcha list.
  • getCodeExample(pattern: "voice_qa_assistant") — multi-turn wake → speak → record → LLM → speak loop, the canonical voice-glasses flow.
  • getCodeExample(pattern: "barge_in_speak") — cancel TTS the moment the user starts talking.
  • SDK reference / library_api — the full glasses.voice sub-client surface.