Voice triggers
Wire a voice command on the glasses to an action in your app. Works on Meta Ray-Ban via the phone's speech recognizer over Bluetooth. Phrases auto-surface on the connection page and the simulator's click-to-fire panel.
A voice trigger on Extentos is just a Kotlin lambda or a Swift closure that runs when the user says a phrase. There is no spec file, no trigger / block tree, no special registration step. Your handler class calls glasses.voice.onPhrase("X") { ... } and the library does the rest: subscribes to the phone's speech recognizer, matches the phrase against incoming transcripts, runs your handler, and surfaces the phrase on the host-app connection page and the simulator's right-rail click-to-fire panel.
This page covers the canonical pattern, the stops cancellation primitive, the three usage tiers (when to drop down to raw transcripts), and the per-vendor caveats.
The canonical pattern
class VisionHandler(private val glasses: ExtentosGlasses) {
private var registration: VoiceRegistration? = null
fun start() {
registration = glasses.voice.onPhrase(
phrase = "describe what you see",
label = "Describe scene",
) {
// Your handler — runs when the user says the phrase.
val photo = glasses.camera.capturePhoto().valueOrNull() ?: return@onPhrase
// ...vision LLM + speak the description...
}
}
fun stop() { registration?.cancel() }
}final class VisionHandler: @unchecked Sendable {
private let glasses: any ExtentosGlasses
private var registration: VoiceRegistration?
init(glasses: any ExtentosGlasses) { self.glasses = glasses }
func start() {
registration = glasses.voice.onPhrase(
phrase: "describe what you see",
label: "Describe scene"
) { [glasses] in
let result = await glasses.camera.capturePhoto()
guard case .success(let photo) = result else { return }
// ...vision LLM + speak the description...
}
}
func stop() { registration?.cancel() }
}That's everything. No register table, no callback ID, no spec edit. The library matches the phrase against glasses.audio.transcriptions() under the hood — case-insensitive substring on FINAL transcripts. The handler runs under the library's coroutine scope (Android) / Task (iOS). Returning from the handler closes the trigger run; the next utterance of the phrase fires it again.
Stop conditions
A second voice command that cancels the first one mid-flow:
glasses.voice.onPhrase(
phrase = "play cat video",
label = "Play cat video",
stops = listOf("stop the video"),
) {
catPlayer.play() // suspending — cancelled when user says "stop the video"
}registration = glasses.voice.onPhrase(
phrase: "play cat video",
label: "Play cat video",
stops: ["stop the video"]
) { [catPlayer] in
await catPlayer.play() // cancelled when user says "stop the video"
}stops is a list of phrases that, while the handler is running, will cancel the handler's coroutine / Task. The cancellation is plain Kotlin structured concurrency / Swift Task.cancel() — your try/finally or defer blocks run normally. For cleanup that itself suspends (releasing a MediaPlayer, draining a queue, sending a final speak), wrap it in withContext(NonCancellable) { ... } (Kotlin) or use a detached cleanup Task (Swift).
stops is also a UI affordance: the simulator renders the stop phrases as nested STOP rows under the parent VOICE card, gated by is_active so you can only click them while the parent is running. The host-app connection page renders them as indented italic rows under the parent ("Say to me" section).
The pattern works for any process — playing a video, muting the speaker, starting a recording, running an LLM call. The library doesn't know what the action is; the stops list is just metadata + a cancellation signal.
Three usage tiers
| Tier | Customer code | Auto-display | Auto-cancel on stop |
|---|---|---|---|
| 1 (default) | glasses.voice.onPhrase(phrase, label, stops) { ... } | ✅ | ✅ |
| 2 (custom match, visible) | raw transcriptions() + glasses.voice.registerHint(phrase, label, stops) | ✅ | manual |
| 3 (raw / hidden) | raw transcriptions().collect { ... } only | ❌ (by choice) | manual |
Drop to tier 2 when the substring match isn't expressive enough — you need regex, state machines, or per-utterance guards. Drop to tier 3 only when you actively don't want the phrase visible on the connection page or in the simulator (rare).
// Tier 2 — raw matching with UI affordance:
val hint = glasses.voice.registerHint(
phrase = "translate this",
label = "Translate",
)
glasses.audio.transcriptions().collect { t ->
if (t !is Transcript.Final) return@collect
if (myRegex.matches(t.text)) {
glasses.voice.reportFired(hint.id) // keep stats honest
handleMatch(t.text)
}
}Per-vendor caveats (Meta Ray-Ban)
Hey Metais not exposed. Meta's system wake word goes to Meta AI; third-party apps cannot intercept it. Your wake phrase is whatever string you match againsttranscriptions()(directly or viaonPhrase).- The phone is the recognizer. Audio captures on the glasses, streams over Bluetooth HFP/SCO to the phone, and the phone's
SpeechRecognizer(Android) /SFSpeechRecognizer(iOS) emits transcripts. No on-device STT on the glasses themselves. listening_modetoggle gates STT. When the user flips Voice Activation off on the connection page,transcriptions()stops emitting andonPhrasehandlers never fire. This is intentional — it's the user's hard kill-switch.
Testing in the simulator
Open extentos.com/s/[sessionId] after createSimulatorSession. The right rail shows one card per registered phrase:
- VOICE pill + the phrase + the registration's stable id +
fired N times · Xs ago. - Nested STOP rows under each parent with declared stops, disabled until the parent is
is_active = true. - Click the card to inject the phrase as a synthetic
stt_transcript— the dispatch path is identical to a real utterance, so any matcher you wrote runs the same way in sim and on real glasses.
The simulator parity guarantee is strict: clicking a card in the browser fires the same code path a real spoken phrase would. If your handler works in sim, it works on the glasses.
Common gotchas
- Substring matching is forgiving.
onPhrase("stop")matches "I'm going to stop now" and "let me think, stop right there." Phrase carefully or use tier-2 raw matching with a tighter regex. - Overlapping phrases each fire.
onPhrase("start")andonPhrase("start recording")both fire on "start recording now" — each registration is independent. Order your phrases from most-specific to least, and have your handlers guard for the overlap if it matters. - Stops only apply during handler execution. Saying "stop the video" before any "play cat video" is dead text — no parent handler running means no listener attached. The simulator UI gates the STOP rows by
parent.is_activefor exactly this reason. registerHintdoes NOT auto-cancel. Stops cancellation is exclusive toonPhrase. If you're writing a tier-2 handler, write the cancellation yourself.
Related
getCapabilityGuide(feature: "voice_command")— minimal Kotlin + Swift snippets + the full gotcha list.getCodeExample(pattern: "voice_qa_assistant")— multi-turn wake → speak → record → LLM → speak loop, the canonical voice-glasses flow.getCodeExample(pattern: "barge_in_speak")— cancel TTS the moment the user starts talking.- SDK reference / library_api — the full
glasses.voicesub-client surface.