Capabilities
The Extentos capability vocabulary — vendor-agnostic SDK primitives (audio.transcriptions, audio.recordDiscrete, audio.speak, camera.capturePhoto, camera.videoFrames, hardware events) that your handler code subscribes to. How abstract capabilities translate to platform-specific calls on iOS and Android, how permissions derive automatically, how validation negotiates against per-vendor manifests, and why a shared vocabulary plus a standard transport interface is what makes the same code run across Meta Ray-Ban, Mentra G1, Android XR, and future smart-glasses vendors.
A capability in Extentos is a vendor-agnostic SDK primitive your handler code subscribes to — audio.transcriptions() for continuous STT, audio.recordDiscrete() for a silence-VAD-bounded clip with auto-transcription, audio.speak() for TTS, camera.capturePhoto() for a still, camera.videoFrames() for a frame stream, plus hardware-event flows for thermal / hinges / call-state / lifecycle. The capability vocabulary is the contract between your handler code (which calls capability primitives) and the underlying transport (which translates those calls into platform-specific operations — Meta DAT on Ray-Ban Meta today, Mentra's SDK on Mentra G1 tomorrow). This page is the technology behind that contract: the full vocabulary, how per-vendor manifests narrow it, how platform permissions derive automatically, and why a shared vocabulary plus a standard transport interface is the design that makes the same code run across every supported smart-glasses vendor.
The capability layer
Three coordinated layers turn an abstract capability into running code on real hardware:
| Layer | What it does | Owned by |
|---|---|---|
| 1. Capability vocabulary | The set of abstract SDK primitives — audio, camera, speak, hardware-event flows, the toggle surface, the connection-state machine. The same on every vendor. | Extentos (the language) |
| 2. Per-vendor capability manifest | Which capabilities a specific vendor exposes — e.g., Meta Ray-Ban supports capture_photo and transcription_incremental via the DAT public toolkit but not custom_gesture. Some capabilities are per-DEVICE within a vendor: display is available on Ray-Ban Display but not Ray-Ban Meta — apps branch on glasses.display.isAvailable, never the model name. | Each vendor (the subset) |
| 3. Transport implementation | The code that translates an abstract glasses.camera.capturePhoto() call into vendor-specific API calls (Meta DAT, Mentra SDK, etc.). One transport per vendor. | Each vendor's transport (the wiring) |
Your handler is written in layer 1 — pure capability calls, no vendor names. The MCP server's validateIntegration tool checks your extentos.manifest.json's declared capability list against layer 2 for the target vendor — flagging anything the vendor doesn't expose. At runtime, the library's selected transport (layer 3) translates your calls into actual platform operations. Same code, different transport — that's how a handler written for Meta Ray-Ban can later target a vendor with a different SDK shape without rewrites.
At runtime, your installed agent has these live. Once Extentos's MCP server is registered with your agent, the agent calls
getPlatformInfofor the capability catalog scoped to the current vendor,getCapabilityGuide(feature)for per-feature call shapes, andgetCodeExample(pattern)for canonical compositions in Kotlin + Swift. The static tables on this page are the human-readable reference for pre-install evaluation, SEO, and out-of-context lookup; the live MCP response is authoritative when composing real handler code.
Audio primitives
The audio surface is the most-used part of the SDK. Three primary primitives plus barge-in cancellation:
| Primitive | Shape | When to use |
|---|---|---|
glasses.audio.transcriptions(config) | continuous Flow<Transcript> (Kotlin) / AsyncStream<Transcript> (Swift) — Partial + Final | Wake-phrase matching, live captions, continuous STT |
glasses.audio.recordDiscrete(config) | suspending one-shot, returns AudioRecording with transcript + audioDurationMs | Free-form question capture — silence-VAD turns the mic off when the user pauses |
glasses.audio.speak(text) | suspending TTS via the phone's native engine | Speaking responses through the glasses speaker (HFP) |
glasses.audio.cancelSpeak() | fire-and-forget interrupt | Barge-in — kill TTS the moment the user starts speaking |
glasses.audio.audioChunks(config) | continuous raw chunk stream | Custom on-device STT, passthrough, or non-text audio processing |
glasses.audio.earcon(sound, volume) | suspending one-shot canned tone | START / COMPLETE / ERROR / NOTIFY confirmations |
The canonical voice-Q&A pattern composes these: register a wake phrase with glasses.voice.onPhrase(phrase, label, stops) { ... } (sugar over transcriptions() that also surfaces the phrase on the connection page and the simulator's click-to-fire panel, with automatic handler cancellation when a stops phrase fires), then call speak() to acknowledge, recordDiscrete() to capture the user's question, and speak() again for the answer. Customers needing regex / stateful matching skip onPhrase and subscribe to transcriptions() directly, optionally calling glasses.voice.registerHint(...) to keep the UI affordance visible. getCodeExample(pattern: "voice_qa_assistant") returns the full ~100-line composition in Kotlin and Swift.
Camera primitives
| Primitive | Shape | When to use |
|---|---|---|
glasses.camera.capturePhoto(config) | suspending one-shot, returns Photo (URI + width / height / format) | Vision LLM input, save-to-gallery, single-frame analysis |
glasses.camera.captureVideo(config) | suspending one-shot, returns VideoClip | Bounded clip recording |
glasses.camera.videoFrames(config) | continuous Flow<VideoFrame> / AsyncStream<VideoFrame> | Live vision pipelines, frame-by-frame analysis (typically LOW resolution at 2 fps for cost) |
Photo URI helpers (Photos.loadBase64(uri), Photos.loadBytes(uri), Photos.loadBitmap(uri), Photos.mediaTypeFromUri(uri) on Android; Photo.loadImage() on iOS) bridge the data-URI / file-URI scheme variance across transports — write your handler against the helpers and the same code runs on BrowserSim, LocalSim, and real Meta Ray-Ban transports.
Speech output
glasses.audio.speak() routes through the phone's native TTS engine (TextToSpeech on Android, AVSpeechSynthesizer on iOS) and plays audio through the glasses speaker over Bluetooth A2DP. The phone is the synthesizer; the glasses are the speaker. This is intentional — TTS quality is bound by the phone engine, which matters when comparing against premium voice providers like ElevenLabs (for which you'd fetch audio yourself from the customer's handler and play through the phone speaker until direct-audio-bytes routing to the glasses lands as a future SDK feature).
cancelSpeak() interrupts the active utterance immediately for the barge-in flow. See getCodeExample(pattern: "barge_in_speak") for the canonical TaskGroup / structured-concurrency pattern that cancels speak the moment a Final transcript arrives.
Hardware-event flows
Hardware events tell your app the world changed: temperature, hinge state, audio routing, call state, lifecycle, notifications, location. Your handler subscribes via glasses.runtime.events (a Flow<RuntimeEvent> / AsyncStream<RuntimeEvent> that emits typed event values) and pattern-matches the variants. The browser simulator's UI provides clickable buttons that inject each event so you can test what your handler does when (say) thermal throttling kicks in mid-capture; on real glasses, the hardware fires them and the same handler runs.
| Event | What it means | Typical handler behavior |
|---|---|---|
thermal_warning | The hardware is heating up; severity ranges from light to critical | Flip battery_save_mode toggle; dial back video frame rate |
hinges_closed | The user folded the glasses (typically removes them) | Pause active streams, end session |
audio_route_changed | The Bluetooth audio route changed (A2DP ↔ HFP/SCO) | Adjust playback strategy, re-route TTS |
incoming_call_detected | The phone has an incoming call; audio routing will preempt | Pause TTS, defer voice listening until call ends |
app_lifecycle_changed | The phone app moved between foreground / background / destroyed | Suspend or resume sessions accordingly |
connection_state_changed | The glasses connection transitioned states | Surface connection status UI; trigger reconnect logic |
phone_notification_forwarded | An OS notification was forwarded to the glasses | Read it aloud, suppress repeats |
location_updated | A configured location threshold was crossed | Geofence-based behaviors |
searchDocs(topic: 'connection_state_model') and searchDocs(topic: 'event_log_schema') cover the event types and the diagnostic surface in full.
Toggles — runtime gates the user controls
Eight runtime toggles persist across app restarts; the user owns them via the connection page UI; your handler reads them via glasses.toggles.state.
| Toggle | What it gates | Default |
|---|---|---|
listening_mode | STT recognizer (off disables transcriptions entirely) | unset = on |
camera_streaming_enabled | Every camera primitive | true |
audio_capture_enabled | Every audio-capture primitive | true |
transcription_enabled | The STT layer on top of audio capture | false |
privacy_mode | The super-toggle — kills capture + audio + STT + notifications | false |
battery_save_mode | Clamps videoFrames to LOW + 2 fps | false |
voice_confirmations | Auto-earcons around voice-initiated handler calls | true |
audio_video_coexistence_policy | HFP / A2DP conflict policy | prefer_video |
searchDocs(topic: 'toggles') covers each toggle's enforcement status and gotchas in depth.
Permission derivation
Each capability declares its platform-permission requirements once, centrally. The MCP server's getPermissions(capabilities, platform) returns the exact set:
| Capability | Android (manifest) | iOS (Info.plist) |
|---|---|---|
capture_photo, capture_video, video_frames | CAMERA, BT permissions | NSCameraUsageDescription, BT keys |
record_audio, audio_chunks | RECORD_AUDIO, BT permissions | NSMicrophoneUsageDescription, BT keys |
transcription_incremental | RECORD_AUDIO, BLUETOOTH_* | NSMicrophoneUsageDescription, NSSpeechRecognitionUsageDescription, BT keys |
speak | BT permissions | BT keys |
display (Ray-Ban Display) | BT permissions only — rendering is outbound over the existing DAT connection | BT keys (iOS port pending) |
| Any glasses connection | BLUETOOTH_CONNECT, BLUETOOTH_SCAN, BLUETOOTH_ADMIN | NSBluetoothAlwaysUsageDescription, MWDAT plist keys |
Update extentos.manifest.json's capabilities array when you add a primitive to your handler — the next getPermissions call surfaces the new keys; validateIntegration confirms they land in the manifest and Info.plist. A developer who never had to think about iOS speech-recognition entitlements still ends up with a correct NSSpeechRecognitionUsageDescription because the capability said "transcription_incremental" and the toolchain knew what that meant on iOS.
Validation and capability negotiation
Three MCP tools work together to keep a handler aligned with what the target vendor can actually do:
getPlatformInfo({ glasses: "<vendor>" })— returns the vendor's capability manifest. Which audio / camera / hardware-event primitives it supports. Which it doesn't. Which are GA versus preview.validateIntegration()— checks the project against the vendor's capability list for the configured target. Flags capabilities the manifest declares that the vendor doesn't expose, missing permissions, dependency drift. Returns structured errors the agent can act on.getProductionChecklist()— late-stage gate. Verifies permissions are wired, credentials are set, foreground-service hints are present for continuous-capture flows, edge cases are handled. Run before shipping.
In the typical agent flow, getPlatformInfo is the first call (discovery), validateIntegration runs after every structural change (correctness gate), and getProductionChecklist runs once the developer is preparing to ship. The capability layer is what lets these tools be deterministic — they answer yes/no against the manifest rather than guessing.
The transport contract
Each vendor provides a GlassesTransport implementation — the code that translates abstract capability calls into platform-specific API calls. The interface is identical across vendors:
GlassesTransport
├─ connect(deviceId)
├─ capturePhoto(config)
├─ captureVideo(config)
├─ recordAudio(config) ─► returns AudioRecording (transcript + bytes)
├─ videoFrames(config) ─► continuous stream
├─ audioChunks(config) ─► continuous stream
├─ transcriptions(config) ─► continuous stream (Partial + Final)
├─ speak(text, config)
├─ cancelSpeak()
├─ earcon(sound, volume)
└─ events ─► transport state, hardware alerts, errorsA vendor that supports a capability implements the corresponding method against its SDK. A vendor that doesn't support a capability either fails fast at validateIntegration (preferred — caught before runtime) or surfaces a typed Result error at runtime (fallback for capabilities that depend on runtime state, like permissions).
This is the engineering boundary that makes "add a vendor = implement the interface" a clean, bounded task — not a sprawling rewrite. For the deep dive on how transports work and what each implementation does, see transport vs app simulation.
Why a shared vocabulary is the right design
Five reasons the capability layer is shaped this way:
- Vendor portability is structural, not negotiated. Because handler code is written against the capability primitives instead of vendor-specific calls, an app targeting Meta Ray-Ban today can target a future vendor by switching the transport — no code rewrite. The portability is a property of the architecture, not something an individual developer has to engineer per project.
- Validation is deterministic. The capability vocabulary is finite and the per-vendor manifest is a known set.
validateIntegrationanswers "does this app run on this vendor?" with a yes/no plus structured errors. That determinism is what lets an AI agent confidently mutate the integration — every change has a clear validation outcome. - Permissions derive automatically. Each capability declares its platform-permission requirements once, centrally. Add
transcription_incrementalto your handler andextentos.manifest.json's capabilities array; the iOS Info.plist and the Android manifest get the right keys without the developer learning whatNSSpeechRecognitionUsageDescriptionis. - Simulators are honest. The browser simulator and the on-device Mock simulator both implement the same capability vocabulary the production transports do. There's no "simulator-only" or "production-only" capability — anything you can run in simulation runs in production, and vice versa.
- New vendor onboarding is bounded work. Adding Mentra G1 or Android XR support is "implement the
GlassesTransportinterface against the new SDK and declare the capability manifest." No SDK shape changes, no migrations for existing developer handlers, no new MCP tools.
Targeting multiple vendors
The capability layer is what makes this technically possible. The strategic story — what supported and roadmap vendors are, when each ships, how to think about portability when planning your app — lives on /docs/vendors as the section landing page, with per-vendor manifests at /docs/vendors/meta, /docs/vendors/mentra, /docs/vendors/android-xr, and /docs/vendors/apple.
A future page will cover the runtime semantics of multi-vendor apps — graceful degradation when a target vendor doesn't expose a capability the handler uses, fallbacks, validation policies for "this app must run on at least N of these vendors." That's deferred until a second vendor is shipping, when the rules will be concrete enough to commit to. For now: target one vendor at a time, let validateIntegration confirm fit, and rely on the capability layer to keep your handler code portable when the time comes.
Frequently asked questions
Can I add a new capability that isn't in the vocabulary?
Not directly — the capability vocabulary is a coordinated contract across the SDK, the validator, both simulators, and every vendor's transport. Extending it is an Extentos library change. If a capability you need doesn't exist, the path is to file an issue describing the use case; it's added when there's a cross-vendor primitive worth standardizing.
For app-specific behavior that doesn't need a new SDK primitive — custom AI processing, business logic, network calls — that's just code in your handler class. The handler is your code; you can do anything in it. See searchDocs(topic: 'custom_handlers') for the canonical handler shape and searchDocs(topic: 'custom_extensions') for the framing of how to compose around missing primitives (e.g., custom on-device STT against audio.audioChunks()).
How does the manifest know what permissions to derive on iOS vs Android?
The capability vocabulary has a per-platform permission map baked in. transcription_incremental declares NSSpeechRecognitionUsageDescription on iOS and RECORD_AUDIO on Android, plus the Bluetooth keys both platforms need. The MCP server's getPermissions tool returns the current set given the capability list; generateConnectionModule writes them into the manifest and Info.plist on initial scaffold.
Are streams metered the same as one-shot calls?
The library emits stream.started, stream.stopped, and stream.backpressure events into the structured event log regardless of one-shot vs continuous shape. Browser-simulator session minting requires a free email-only account (Google or email + password — see pricing); once linked, sessions and the events they emit are unlimited. MCP tool calls don't require an account at all.
How is glasses.audio.transcriptions() different from "Hey Meta"?
"Hey Meta" is Meta's system-level wake word — third-party apps can't hook it. glasses.audio.transcriptions() is the continuous-transcript primitive: the glasses microphone captures audio, streams it to the phone via Bluetooth HFP/SCO, the phone's native speech recognizer (SpeechRecognizer on Android, SFSpeechRecognizer on iOS) emits Partial + Final transcripts, and your handler matches strings against them to detect wake phrases. No "Hey Meta" prefix; the wake word is whatever string your handler matches against. See vendors/meta for the full audio-architecture story and searchDocs(topic: 'voice_ux_guide') for phrase-design rules.
Does the capability layer add runtime overhead?
Negligible. The library is a thin translation between abstract calls and the vendor's SDK. There's no extra serialization, no extra IPC, no proxy layer. Capability indirection is compile-time (the transport dispatch is a single method dispatch); runtime is direct SDK calls.
Related concepts
- Architecture — how the capability layer fits into the broader system (MCP, library, backend, transports)
- Transport vs app simulation — the deep dive on how each transport implements the capability vocabulary
- Vendors — the strategic multi-vendor story; per-vendor capability manifests
- Vendors: Meta Ray-Ban — the GA target's full capability manifest
- Quickstart with an AI agent — install the MCP server and see capabilities in action
Projects
How Extentos models projects and identifies them across Android and iOS — the matching reverse-DNS convention, the auto-join behavior at mint time, the Merge action when identifiers diverge, and what's stored on each session row.
Sessions
Lifecycle of an Extentos session — connect, run, disconnect, error states, and reconnection behavior.