Capabilities

The Extentos capability vocabulary — vendor-agnostic primitives (capture_photo, capture_video, record_audio, speak_text, voice_command, tap, double_tap, sensor reads, hardware events) that an AppSpec composes from. How abstract capabilities translate to platform-specific calls on iOS and Android, how permissions derive automatically, how validation negotiates against per-vendor manifests, and why a shared vocabulary plus a standard transport interface is what makes the same code run across Meta Ray-Ban, Mentra G1, Android XR, and future smart-glasses vendors.

A capability in Extentos is a vendor-agnostic primitive your app composes from — capture_photo, capture_video, record_audio, speak_text, voice_command, tap, double_tap, sensor reads, hardware events. The capability vocabulary is the contract between your AppSpec (which is written in capability terms) and the underlying transport (which translates those capabilities into platform-specific calls — Meta DAT on Ray-Ban Meta today, Mentra's SDK on Mentra G1 tomorrow). This page is the technology behind that contract: the full vocabulary, how the AppSpec compiler validates it against per-vendor manifests, how platform permissions derive automatically, and why a shared vocabulary plus a standard transport interface is the design that makes the same code run across every supported smart-glasses vendor.

The capability layer

Three coordinated layers turn an abstract capability into running code on real hardware:

Layer	What it does	Owned by
1. Capability vocabulary	The set of abstract primitives — block kinds, trigger types, action types, stream types, hardware-event kinds. The same on every vendor.	Extentos (the language)
2. Per-vendor capability manifest	Which capabilities a specific vendor exposes — e.g., Meta Ray-Ban supports `capture_photo` and `voice_command` via the DAT public toolkit but not `display_render` or `custom_gesture`.	Each vendor (the subset)
3. Transport implementation	The code that translates an abstract `glasses.camera.capturePhoto()` call into vendor-specific API calls (Meta DAT, Mentra SDK, etc.). One transport per vendor.	Each vendor's transport (the wiring)

Your AppSpec is written in layer 1 — pure capability primitives, no vendor names. The MCP server's validateIntegration tool checks your spec against layer 2 for the target vendor — flagging anything the vendor doesn't expose. At runtime, the library's selected transport (layer 3) translates your code into actual platform calls. Same spec, same code, different transport — that's how an app written for Meta Ray-Ban can later target a vendor with a different SDK shape without rewrites.

At runtime, your installed agent has these live. Once Extentos's MCP server is registered with your agent, the agent calls searchDocs(topic: "block_types") for blocks (or "trigger_types" / "action_types" / "stream_types" / "spec_format") and gets the catalog with inline minimal examples scoped to the current vendor's capability manifest. The static tables on this page are the human-readable reference for pre-install evaluation, SEO, and out-of-context lookup; the live MCP response is authoritative when composing a real spec.

Block kinds — the things glasses do

A block is an action the glasses perform. There are four block kinds in AppSpecV2. Each compiles to a transport call that the vendor's implementation honors.

Block kind	What it does	Maps to (transport-level)
`capture_photo`	Capture a still frame from the glasses camera	`transport.capturePhoto(config)`
`capture_video`	Record a video clip to glasses storage	`transport.captureVideo(config)`
`record_audio`	Record an audio clip from the glasses microphone (streamed via Bluetooth HFP/SCO to the phone)	`transport.recordAudio(config)`
`speak_text`	Synthesize speech on the phone and play it through the glasses speaker via Bluetooth A2DP	`transport.speak(text, config)`

These are intentionally bounded. There's no glasses.run_arbitrary_code block; capabilities are a finite set of well-typed primitives the AppSpec compiler can validate, the simulator can faithfully reproduce, and the transport interface can implement deterministically per vendor.

Why blocks are a finite set: every additional block kind is a contract that has to land in the spec schema, the validator, the browser simulator, the local simulator, and every vendor's transport implementation simultaneously. Keeping the set small keeps the cross-vendor portability story simple. New blocks are added when there's a real cross-vendor capability worth standardizing — not for one-off vendor features.

Trigger types — what starts a flow

A trigger is an event that fires a flow in your AppSpec. Five trigger types:

Trigger type	What fires it	Notes
`voice_command`	The wearer says a configured phrase. Phone's STT recognizes audio captured from the glasses microphone over BT.	Custom phrases only — "Hey Meta" is reserved by Meta and not third-party-exposed
`manual_launch`	Developer explicitly invokes a flow from app code	For app-driven entry points
`capture_button`	The wearer presses the glasses' physical capture button	Vendor-dependent; Meta Ray-Ban supports this
`tap`	A single tap on a designated glasses surface	Not in Meta DAT public preview as of 2026-04
`double_tap`	A double tap on a designated glasses surface	Not in Meta DAT public preview as of 2026-04

The capability manifest tells you which trigger types a target vendor actually supports. validateIntegration rejects a spec that uses a trigger type the target doesn't expose, before runtime — so a developer writing a tap trigger gets a build-time error explaining that Meta Ray-Ban's public toolkit doesn't currently support it, instead of a runtime no-op.

Action types — what flows do

A flow is a sequence of actions executed when a trigger fires. Four action types:

Action type	What it does
`block_call`	Invoke one of the four block kinds (capture_photo, capture_video, record_audio, speak_text)
`ai_call`	Call out to an AI handler in the developer's app (vision model, LLM, translation, OCR — the developer brings their own provider keys)
`branch`	Conditional execution based on a runtime variable
`set_variable`	Bind a value into the flow's variable scope

Variable substitution uses {{key}} templates. A captured photo bound to {{capture.uri}} is automatically inlined when an ai_call action references it. The library's interpreter resolves templates at runtime; the simulator surfaces unresolved templates as runtime:TemplateUnresolved events so the agent can debug them.

ai_call is the bridge between the spec and the developer's own code. Extentos doesn't sit in the AI cost path — the developer's app handler runs the AI call against their chosen provider (Anthropic, OpenAI, local model, whatever) using their own credentials. The spec just declares "an AI handler is needed here"; the handler is the developer's app_callback implementation.

Streams — continuous capability flows

Some capabilities are streams, not one-shot blocks. They produce continuous data the developer's app subscribes to.

Stream type	What it streams	Typical config
Video frame stream	Camera frames at a configured frame rate and resolution	`LOW` resolution / `2 fps` is the typical vision-pipeline default
Audio chunk stream	Microphone audio in fixed-duration chunks for STT or live processing	Configurable chunk cadence (e.g., 20 ms)

Streams have their own backpressure semantics (the library applies a PresentationQueue policy so a slow consumer doesn't block the transport). The simulator emits stream.started, stream.stopped, and stream.backpressure events into the structured event log. Stream config is requested; the transport may downgrade based on hardware policy or coexistence constraints — the actual configuration is reported back via stream.started.negotiatedConfig.

Hardware events — what happens to the glasses

The glasses themselves emit events the app can listen for. These are vendor-supplied signals about the hardware state.

Event kind	What it means	Use case
`thermal_warning`	The hardware is heating up; throttle stream rates	Dial back video frame rate, pause non-essential capture
`hinges_closed`	The user folded the glasses (typically removes them)	Pause active streams, end session, prompt re-pair on unfold
`audio_route_changed`	The Bluetooth audio route changed (A2DP ↔ HFP/SCO)	Adjust playback strategy, re-route TTS
`incoming_call_detected`	The phone has an incoming call; audio routing will preempt	Pause TTS, defer voice triggers until call ends
`app_lifecycle_changed`	The phone app moved between foreground and background	Suspend or resume sessions accordingly

These map to transport.hardware_alert events in the structured event log. The AppSpec can also dispatch a trigger on the same kinds — for example, hinges_closed can fire a trigger that gracefully ends an in-progress flow. That dual surface (event observation + trigger dispatch) is what lets apps be both reactive and resilient to hardware reality.

Permission derivation

The AppSpec's derived.capabilitiesUsed field is computed automatically by the spec compiler from the blocks, triggers, and streams the spec declares. Each capability has a known set of platform permissions associated with it:

Capability	Android (manifest)	iOS (Info.plist)
`capture_photo`, `capture_video`, video stream	`CAMERA`, BT permissions	`NSCameraUsageDescription`, BT keys
`record_audio`, audio stream	`RECORD_AUDIO`, BT permissions	`NSMicrophoneUsageDescription`, BT keys
`voice_command`	`RECORD_AUDIO`, `BLUETOOTH_*`	`NSMicrophoneUsageDescription`, `NSSpeechRecognitionUsageDescription`, BT keys
`speak_text`	BT permissions	BT keys
Any glasses connection	`BLUETOOTH_CONNECT`, `BLUETOOTH_SCAN`, `BLUETOOTH_ADMIN`	`NSBluetoothAlwaysUsageDescription`, MWDAT plist keys

The MCP server's getPermissions tool returns the exact set for the current spec, per platform, so the agent can keep the manifest and Info.plist in sync without the developer hand-maintaining them. When a capability is added to the spec (e.g., the agent adds a voice_command trigger), the next getPermissions call surfaces the new keys; the agent (or generateConnectionModule) writes them in.

This is the practical payoff of the abstraction: a developer who never had to think about iOS speech-recognition entitlements still ends up with a correct NSSpeechRecognitionUsageDescription because the spec said "voice_command" and the toolchain knew what that meant on iOS.

Validation and capability negotiation

Three MCP tools work together to keep an AppSpec aligned with what the target vendor can actually do:

getPlatformInfo({ glasses: "<vendor>" }) — returns the vendor's capability manifest. Which blocks, triggers, and streams it supports. Which it doesn't. Which are GA versus preview.
validateIntegration() — checks the AppSpec against the vendor manifest for the configured target. Flags capabilities the spec uses that the vendor doesn't expose. Returns structured errors the agent can act on.
getProductionChecklist() — late-stage gate. Verifies permissions are wired, credentials are set, and edge cases are handled. Run before shipping.

In the typical agent flow, getPlatformInfo is the first call (discovery), validateIntegration runs after every spec mutation (correctness gate), and getProductionChecklist runs once the developer is preparing to ship. The capability layer is what lets these tools be deterministic — they answer yes/no against the manifest rather than guessing.

The transport contract

Each vendor provides a GlassesTransport implementation — the code that translates abstract capability calls into platform-specific API calls. The interface is identical across vendors:

GlassesTransport
├─ connect(deviceId)
├─ capturePhoto(config)
├─ captureVideo(config)
├─ recordAudio(config)
├─ videoFrames(config)   ─► continuous stream
├─ audioChunks(config)   ─► continuous stream
├─ speak(text, config)
├─ playEarcon(sound, volume)
└─ events                ─► transport state, hardware alerts, errors

A vendor that supports a capability implements the corresponding method against its SDK. A vendor that doesn't support a capability either fails fast at validateIntegration (preferred — caught before runtime) or surfaces a TransportError.HardwareUnavailable at runtime (fallback for capabilities that depend on runtime state, like permissions).

This is the engineering boundary that makes "add a vendor = implement the interface" a clean, bounded task — not a sprawling rewrite.

For the deep dive on how transports work and what each implementation does, see transport vs app simulation.

Why a shared vocabulary is the right design

Five reasons the capability layer is shaped this way:

Vendor portability is structural, not negotiated. Because the AppSpec is written in capability primitives instead of vendor-specific calls, an app targeting Meta Ray-Ban today can target a future vendor by switching the transport — no code rewrite, no spec migration. The portability is a property of the architecture, not something an individual developer has to engineer per project.
Validation is deterministic. The capability vocabulary is finite and the per-vendor manifest is a known set. validateIntegration answers "does this spec run on this vendor?" with a yes/no plus structured errors. That determinism is what lets an AI agent confidently mutate the spec — every change has a clear validation outcome.
Permissions derive automatically. Each capability declares its platform-permission requirements once, centrally. Add a voice_command trigger to the spec; the iOS Info.plist and the Android manifest get the right keys without the developer learning what NSSpeechRecognitionUsageDescription is.
Simulators are honest. The browser simulator and the on-device Mock simulator both implement the same capability vocabulary the production transports do. There's no "simulator-only" or "production-only" capability — anything you can run in simulation runs in production, and vice versa.
New vendor onboarding is bounded work. Adding Mentra G1 or Android XR support is "implement the GlassesTransport interface against the new SDK and declare the capability manifest." No spec format changes, no AppSpec migrations for existing developers, no new MCP tools.

Targeting multiple vendors

The capability layer is what makes this technically possible. The strategic story — what supported and roadmap vendors are, when each ships, how to think about portability when planning your app — lives on /docs/vendors as the section landing page, with per-vendor manifests at /docs/vendors/meta, /docs/vendors/mentra, /docs/vendors/android-xr, and /docs/vendors/apple.

A future page will cover the runtime semantics of multi-vendor apps — graceful degradation when a target vendor doesn't expose a capability the spec uses, fallbacks, validation policies for "this app must run on at least N of these vendors." That's deferred until a second vendor is shipping, when the rules will be concrete enough to commit to. For now: target one vendor at a time, let validateIntegration confirm fit, and rely on the capability layer to keep your spec portable when the time comes.

Frequently asked questions

Can I add a new capability that isn't in the vocabulary?

Not directly — the capability vocabulary is a coordinated contract across the spec schema, the validator, both simulators, and every vendor's transport. Extending it is an Extentos library change. If a capability you need doesn't exist, the path is to file an issue describing the use case; it's added when there's a cross-vendor primitive worth standardizing.

For app-specific behavior that doesn't need a new capability — custom AI processing, business logic, network calls — use ai_call actions and app_callback handlers. The handler is your code; you can do anything in it.

How does the spec know what permissions to derive on iOS vs Android?

The capability vocabulary has a per-platform permission map baked in. voice_command declares NSSpeechRecognitionUsageDescription on iOS and RECORD_AUDIO on Android, plus the Bluetooth keys both platforms need. The MCP server's getPermissions tool returns the current set; generateConnectionModule writes them into the manifest and Info.plist for you.

What happens if I add a `tap` trigger to a Meta Ray-Ban spec?

validateIntegration returns a structured error: "trigger type tap is not supported by vendor meta_rayban as of capability manifest version X". The agent surfaces the error, you either remove the trigger or wait for Meta to expose it in the public DAT toolkit. The spec never silently no-ops.

Are streams metered the same as blocks?

Streams emit stream.started, stream.stopped, and stream.backpressure events into the event log. Each event counts against the simulator runtime-event meter (1000 events free, then sign up for a free account — see pricing). MCP tool calls don't count.

How is `voice_command` different from "Hey Meta"?

"Hey Meta" is Meta's system-level wake word — third-party apps can't hook it. voice_command is Extentos's custom-phrase trigger: the glasses microphone captures audio, streams it to the phone via Bluetooth HFP/SCO, the phone's native speech recognizer (SpeechRecognizer on Android, SFSpeechRecognizer on iOS) recognizes against your configured phrases, and a match dispatches the trigger. No "Hey Meta" prefix; the wake word is your phrase. See vendors/meta for the full audio-architecture story.

Does the capability layer add runtime overhead?

Negligible. The library is a thin translation between abstract calls and the vendor's SDK. There's no extra serialization, no extra IPC, no proxy layer. Capability indirection is compile-time (the AppSpec compiler resolves it once); runtime is direct SDK calls.

Architecture — how the capability layer fits into the broader system (MCP, library, backend, transports)
Transport vs app simulation — the deep dive on how each transport implements the capability vocabulary
Vendors — the strategic multi-vendor story; per-vendor capability manifests
Vendors: Meta Ray-Ban — the GA target's full capability manifest
Quickstart with an AI agent — install the MCP server and see capabilities in action

Capabilities

On this page