Skip to content

MCP-native

Your AI agent already knows how to call tools. SimDrive exposes the iOS simulator (and paired devices) as 32 tools your agent can call directly over Model Context Protocol. No DSL, no selector framework, no glue code — the agent reads ticket, calls tap and type_text, gets back screenshots and structured state.

iOS automation has lived in two camps:

  1. Selector-based (XCUITest, Appium): you write code to find elements by accessibility ID. Brittle, requires app instrumentation, fights with SwiftUI’s runtime element generation.
  2. Click-record-replay (Maestro, classic Studio tools): a human authors the flow once via UI, the tool replays it. Author cost is high.

SimDrive is a third camp: the AI agent is the author. The agent sees the screen (vision), decides what to do (reasoning), and calls a small, well-typed set of action tools. The same agent you already use for code is the agent that drives the simulator.

This only works because MCP standardizes how agents discover and call tools. SimDrive doesn’t ship a special client — it ships an MCP server. Any client that speaks MCP (Claude Code, Claude Desktop, Cursor, future clients) gets the full 32-tool surface for free.

Each tool has a name, a JSON Schema, and a description. Here’s tap:

{
"name": "tap",
"description": "Tap a UI element by text label, mark id, or absolute coordinates.",
"inputSchema": {
"type": "object",
"properties": {
"text": { "type": "string", "description": "Visible text on the target element." },
"mark_id": { "type": "integer", "description": "Mark id returned by observe()." },
"x": { "type": "number" },
"y": { "type": "number" }
}
}
}

The agent reads the schema during connection handshake. It now knows it can tap by text, by mark id, or by coordinates — and it picks the right one for the situation. You never write that mapping yourself.

GroupCountTools
Lifecycle3session_start, session_end, session_status
Observe1observe
Act6tap, swipe, type_text, press_key, clear_field, dismiss_sheet
Record & Replay5record_start, record_stop, replay, list_replays, validate_replay
Performance4perf, perf_baseline, perf_compare, memory
Diagnostics5doctor, app_state, apps, crashes, list_devices
Robustness3dismiss_first_launch_alerts, pre_grant_permissions, set_appearance
Logs1logs
Recordings ops2lint_recordings, migrate_recording
Journeys1load_journey
Meta1version

Full schemas and examples: MCP Tool Reference.

In practice, the agent finds elements three ways:

  1. observe() returns set-of-marks — every interactable element gets a numbered overlay on the screenshot and a corresponding entry in a structured list ({mark_id, type, frame, text, identifier}). The agent picks a mark id and calls tap({mark_id: 7}).
  2. Text fallbacktap({text: "Sign In"}) does a case-insensitive contains-match against visible labels. Faster than observe → pick mark for obvious targets.
  3. Coordinate fallbacktap({x: 195, y: 480}) when neither label nor mark suffices (rare). Recordings prefer text or mark id for portability.

See Concepts → Observe for the set-of-marks model in detail.

The agent pays tokens on the record path (it sees screenshots, reasons, calls tools). The replay path is deterministic — SimDrive re-executes the recorded steps without calling any model. This is the unit economics that makes recordings worth saving:

PathCost per runVariance
Record (AI authors)~$0.05–$0.30 per flowSome — vision tokens vary by screen complexity
Replay (CI re-runs)$0None — bit-for-bit deterministic