Concepts

Capabilities

The Extentos capability vocabulary — vendor-agnostic primitives (capture_photo, capture_video, record_audio, speak_text, voice_command, tap, double_tap, sensor reads, hardware events) that an AppSpec composes from. How abstract capabilities translate to platform-specific calls on iOS and Android, how permissions derive automatically, how validation negotiates against per-vendor manifests, and why a shared vocabulary plus a standard transport interface is what makes the same code run across Meta Ray-Ban, Mentra G1, Android XR, and future smart-glasses vendors.

A capability in Extentos is a vendor-agnostic primitive your app composes from — capture_photo, capture_video, record_audio, speak_text, voice_command, tap, double_tap, sensor reads, hardware events. The capability vocabulary is the contract between your AppSpec (which is written in capability terms) and the underlying transport (which translates those capabilities into platform-specific calls — Meta DAT on Ray-Ban Meta today, Mentra's SDK on Mentra G1 tomorrow). This page is the technology behind that contract: the full vocabulary, how the AppSpec compiler validates it against per-vendor manifests, how platform permissions derive automatically, and why a shared vocabulary plus a standard transport interface is the design that makes the same code run across every supported smart-glasses vendor.

The capability layer

Three coordinated layers turn an abstract capability into running code on real hardware:

LayerWhat it doesOwned by
1. Capability vocabularyThe set of abstract primitives — block kinds, trigger types, action types, stream types, hardware-event kinds. The same on every vendor.Extentos (the language)
2. Per-vendor capability manifestWhich capabilities a specific vendor exposes — e.g., Meta Ray-Ban supports capture_photo and voice_command via the DAT public toolkit but not display_render or custom_gesture.Each vendor (the subset)
3. Transport implementationThe code that translates an abstract glasses.camera.capturePhoto() call into vendor-specific API calls (Meta DAT, Mentra SDK, etc.). One transport per vendor.Each vendor's transport (the wiring)

Your AppSpec is written in layer 1 — pure capability primitives, no vendor names. The MCP server's validateIntegration tool checks your spec against layer 2 for the target vendor — flagging anything the vendor doesn't expose. At runtime, the library's selected transport (layer 3) translates your code into actual platform calls. Same spec, same code, different transport — that's how an app written for Meta Ray-Ban can later target a vendor with a different SDK shape without rewrites.

At runtime, your installed agent has these live. Once Extentos's MCP server is registered with your agent, the agent calls searchDocs(topic: "block_types") for blocks (or "trigger_types" / "action_types" / "stream_types" / "spec_format") and gets the catalog with inline minimal examples scoped to the current vendor's capability manifest. The static tables on this page are the human-readable reference for pre-install evaluation, SEO, and out-of-context lookup; the live MCP response is authoritative when composing a real spec.

Block kinds — the things glasses do

A block is an action the glasses perform. There are four block kinds in AppSpecV2. Each compiles to a transport call that the vendor's implementation honors.

Block kindWhat it doesMaps to (transport-level)
capture_photoCapture a still frame from the glasses cameratransport.capturePhoto(config)
capture_videoRecord a video clip to glasses storagetransport.captureVideo(config)
record_audioRecord an audio clip from the glasses microphone (streamed via Bluetooth HFP/SCO to the phone)transport.recordAudio(config)
speak_textSynthesize speech on the phone and play it through the glasses speaker via Bluetooth A2DPtransport.speak(text, config)

These are intentionally bounded. There's no glasses.run_arbitrary_code block; capabilities are a finite set of well-typed primitives the AppSpec compiler can validate, the simulator can faithfully reproduce, and the transport interface can implement deterministically per vendor.

Why blocks are a finite set: every additional block kind is a contract that has to land in the spec schema, the validator, the browser simulator, the local simulator, and every vendor's transport implementation simultaneously. Keeping the set small keeps the cross-vendor portability story simple. New blocks are added when there's a real cross-vendor capability worth standardizing — not for one-off vendor features.

Trigger types — what starts a flow

A trigger is an event that fires a flow in your AppSpec. Five trigger types:

Trigger typeWhat fires itNotes
voice_commandThe wearer says a configured phrase. Phone's STT recognizes audio captured from the glasses microphone over BT.Custom phrases only — "Hey Meta" is reserved by Meta and not third-party-exposed
manual_launchDeveloper explicitly invokes a flow from app codeFor app-driven entry points
capture_buttonThe wearer presses the glasses' physical capture buttonVendor-dependent; Meta Ray-Ban supports this
tapA single tap on a designated glasses surfaceNot in Meta DAT public preview as of 2026-04
double_tapA double tap on a designated glasses surfaceNot in Meta DAT public preview as of 2026-04

The capability manifest tells you which trigger types a target vendor actually supports. validateIntegration rejects a spec that uses a trigger type the target doesn't expose, before runtime — so a developer writing a tap trigger gets a build-time error explaining that Meta Ray-Ban's public toolkit doesn't currently support it, instead of a runtime no-op.

Action types — what flows do

A flow is a sequence of actions executed when a trigger fires. Four action types:

Action typeWhat it does
block_callInvoke one of the four block kinds (capture_photo, capture_video, record_audio, speak_text)
ai_callCall out to an AI handler in the developer's app (vision model, LLM, translation, OCR — the developer brings their own provider keys)
branchConditional execution based on a runtime variable
set_variableBind a value into the flow's variable scope

Variable substitution uses {{key}} templates. A captured photo bound to {{capture.uri}} is automatically inlined when an ai_call action references it. The library's interpreter resolves templates at runtime; the simulator surfaces unresolved templates as runtime:TemplateUnresolved events so the agent can debug them.

ai_call is the bridge between the spec and the developer's own code. Extentos doesn't sit in the AI cost path — the developer's app handler runs the AI call against their chosen provider (Anthropic, OpenAI, local model, whatever) using their own credentials. The spec just declares "an AI handler is needed here"; the handler is the developer's app_callback implementation.

Streams — continuous capability flows

Some capabilities are streams, not one-shot blocks. They produce continuous data the developer's app subscribes to.

Stream typeWhat it streamsTypical config
Video frame streamCamera frames at a configured frame rate and resolutionLOW resolution / 2 fps is the typical vision-pipeline default
Audio chunk streamMicrophone audio in fixed-duration chunks for STT or live processingConfigurable chunk cadence (e.g., 20 ms)

Streams have their own backpressure semantics (the library applies a PresentationQueue policy so a slow consumer doesn't block the transport). The simulator emits stream.started, stream.stopped, and stream.backpressure events into the structured event log. Stream config is requested; the transport may downgrade based on hardware policy or coexistence constraints — the actual configuration is reported back via stream.started.negotiatedConfig.

Hardware events — what happens to the glasses

The glasses themselves emit events the app can listen for. These are vendor-supplied signals about the hardware state.

Event kindWhat it meansUse case
thermal_warningThe hardware is heating up; throttle stream ratesDial back video frame rate, pause non-essential capture
hinges_closedThe user folded the glasses (typically removes them)Pause active streams, end session, prompt re-pair on unfold
audio_route_changedThe Bluetooth audio route changed (A2DP ↔ HFP/SCO)Adjust playback strategy, re-route TTS
incoming_call_detectedThe phone has an incoming call; audio routing will preemptPause TTS, defer voice triggers until call ends
app_lifecycle_changedThe phone app moved between foreground and backgroundSuspend or resume sessions accordingly

These map to transport.hardware_alert events in the structured event log. The AppSpec can also dispatch a trigger on the same kinds — for example, hinges_closed can fire a trigger that gracefully ends an in-progress flow. That dual surface (event observation + trigger dispatch) is what lets apps be both reactive and resilient to hardware reality.

Permission derivation

The AppSpec's derived.capabilitiesUsed field is computed automatically by the spec compiler from the blocks, triggers, and streams the spec declares. Each capability has a known set of platform permissions associated with it:

CapabilityAndroid (manifest)iOS (Info.plist)
capture_photo, capture_video, video streamCAMERA, BT permissionsNSCameraUsageDescription, BT keys
record_audio, audio streamRECORD_AUDIO, BT permissionsNSMicrophoneUsageDescription, BT keys
voice_commandRECORD_AUDIO, BLUETOOTH_*NSMicrophoneUsageDescription, NSSpeechRecognitionUsageDescription, BT keys
speak_textBT permissionsBT keys
Any glasses connectionBLUETOOTH_CONNECT, BLUETOOTH_SCAN, BLUETOOTH_ADMINNSBluetoothAlwaysUsageDescription, MWDAT plist keys

The MCP server's getPermissions tool returns the exact set for the current spec, per platform, so the agent can keep the manifest and Info.plist in sync without the developer hand-maintaining them. When a capability is added to the spec (e.g., the agent adds a voice_command trigger), the next getPermissions call surfaces the new keys; the agent (or generateConnectionModule) writes them in.

This is the practical payoff of the abstraction: a developer who never had to think about iOS speech-recognition entitlements still ends up with a correct NSSpeechRecognitionUsageDescription because the spec said "voice_command" and the toolchain knew what that meant on iOS.

Validation and capability negotiation

Three MCP tools work together to keep an AppSpec aligned with what the target vendor can actually do:

  • getPlatformInfo({ glasses: "<vendor>" }) — returns the vendor's capability manifest. Which blocks, triggers, and streams it supports. Which it doesn't. Which are GA versus preview.
  • validateIntegration() — checks the AppSpec against the vendor manifest for the configured target. Flags capabilities the spec uses that the vendor doesn't expose. Returns structured errors the agent can act on.
  • getProductionChecklist() — late-stage gate. Verifies permissions are wired, credentials are set, and edge cases are handled. Run before shipping.

In the typical agent flow, getPlatformInfo is the first call (discovery), validateIntegration runs after every spec mutation (correctness gate), and getProductionChecklist runs once the developer is preparing to ship. The capability layer is what lets these tools be deterministic — they answer yes/no against the manifest rather than guessing.

The transport contract

Each vendor provides a GlassesTransport implementation — the code that translates abstract capability calls into platform-specific API calls. The interface is identical across vendors:

GlassesTransport
├─ connect(deviceId)
├─ capturePhoto(config)
├─ captureVideo(config)
├─ recordAudio(config)
├─ videoFrames(config)   ─► continuous stream
├─ audioChunks(config)   ─► continuous stream
├─ speak(text, config)
├─ playEarcon(sound, volume)
└─ events                ─► transport state, hardware alerts, errors

A vendor that supports a capability implements the corresponding method against its SDK. A vendor that doesn't support a capability either fails fast at validateIntegration (preferred — caught before runtime) or surfaces a TransportError.HardwareUnavailable at runtime (fallback for capabilities that depend on runtime state, like permissions).

This is the engineering boundary that makes "add a vendor = implement the interface" a clean, bounded task — not a sprawling rewrite.

For the deep dive on how transports work and what each implementation does, see transport vs app simulation.

Why a shared vocabulary is the right design

Five reasons the capability layer is shaped this way:

  1. Vendor portability is structural, not negotiated. Because the AppSpec is written in capability primitives instead of vendor-specific calls, an app targeting Meta Ray-Ban today can target a future vendor by switching the transport — no code rewrite, no spec migration. The portability is a property of the architecture, not something an individual developer has to engineer per project.
  2. Validation is deterministic. The capability vocabulary is finite and the per-vendor manifest is a known set. validateIntegration answers "does this spec run on this vendor?" with a yes/no plus structured errors. That determinism is what lets an AI agent confidently mutate the spec — every change has a clear validation outcome.
  3. Permissions derive automatically. Each capability declares its platform-permission requirements once, centrally. Add a voice_command trigger to the spec; the iOS Info.plist and the Android manifest get the right keys without the developer learning what NSSpeechRecognitionUsageDescription is.
  4. Simulators are honest. The browser simulator and the on-device Mock simulator both implement the same capability vocabulary the production transports do. There's no "simulator-only" or "production-only" capability — anything you can run in simulation runs in production, and vice versa.
  5. New vendor onboarding is bounded work. Adding Mentra G1 or Android XR support is "implement the GlassesTransport interface against the new SDK and declare the capability manifest." No spec format changes, no AppSpec migrations for existing developers, no new MCP tools.

Targeting multiple vendors

The capability layer is what makes this technically possible. The strategic story — what supported and roadmap vendors are, when each ships, how to think about portability when planning your app — lives on /docs/vendors as the section landing page, with per-vendor manifests at /docs/vendors/meta, /docs/vendors/mentra, /docs/vendors/android-xr, and /docs/vendors/apple.

A future page will cover the runtime semantics of multi-vendor apps — graceful degradation when a target vendor doesn't expose a capability the spec uses, fallbacks, validation policies for "this app must run on at least N of these vendors." That's deferred until a second vendor is shipping, when the rules will be concrete enough to commit to. For now: target one vendor at a time, let validateIntegration confirm fit, and rely on the capability layer to keep your spec portable when the time comes.

Frequently asked questions

Can I add a new capability that isn't in the vocabulary?

Not directly — the capability vocabulary is a coordinated contract across the spec schema, the validator, both simulators, and every vendor's transport. Extending it is an Extentos library change. If a capability you need doesn't exist, the path is to file an issue describing the use case; it's added when there's a cross-vendor primitive worth standardizing.

For app-specific behavior that doesn't need a new capability — custom AI processing, business logic, network calls — use ai_call actions and app_callback handlers. The handler is your code; you can do anything in it.

How does the spec know what permissions to derive on iOS vs Android?

The capability vocabulary has a per-platform permission map baked in. voice_command declares NSSpeechRecognitionUsageDescription on iOS and RECORD_AUDIO on Android, plus the Bluetooth keys both platforms need. The MCP server's getPermissions tool returns the current set; generateConnectionModule writes them into the manifest and Info.plist for you.

What happens if I add a tap trigger to a Meta Ray-Ban spec?

validateIntegration returns a structured error: "trigger type tap is not supported by vendor meta_rayban as of capability manifest version X". The agent surfaces the error, you either remove the trigger or wait for Meta to expose it in the public DAT toolkit. The spec never silently no-ops.

Are streams metered the same as blocks?

Streams emit stream.started, stream.stopped, and stream.backpressure events into the event log. Each event counts against the simulator runtime-event meter (1000 events free, then sign up for a free account — see pricing). MCP tool calls don't count.

How is voice_command different from "Hey Meta"?

"Hey Meta" is Meta's system-level wake word — third-party apps can't hook it. voice_command is Extentos's custom-phrase trigger: the glasses microphone captures audio, streams it to the phone via Bluetooth HFP/SCO, the phone's native speech recognizer (SpeechRecognizer on Android, SFSpeechRecognizer on iOS) recognizes against your configured phrases, and a match dispatches the trigger. No "Hey Meta" prefix; the wake word is your phrase. See vendors/meta for the full audio-architecture story.

Does the capability layer add runtime overhead?

Negligible. The library is a thin translation between abstract calls and the vendor's SDK. There's no extra serialization, no extra IPC, no proxy layer. Capability indirection is compile-time (the AppSpec compiler resolves it once); runtime is direct SDK calls.