Observe & Set-of-Marks
observe() returns two things:
- A screenshot of the simulator’s current screen with numbered overlays on every interactable element (“set-of-marks”).
- A structured element list —
[{mark_id, type, frame, text, identifier}, ...]— covering everything that overlay points to.
Your agent reads both. It calls tap({mark_id: 7}) or
tap({text: "Sign In"}) to act. This is the vision-first replacement for
selector frameworks.
Why set-of-marks
Section titled “Why set-of-marks”The set-of-marks (SoM) pattern was introduced by Microsoft Research for GPT-4V; SimDrive uses it because it solves two real problems:
- Visual grounding — the model sees both the screenshot AND a numbered overlay, so it can refer to “tap 7” instead of describing pixel regions in prose. Cheap, robust, doesn’t require selectors.
- Determinism on replay — the recording can store
mark_id: 7and the text label and frame. On replay, SimDrive re-runs observe, finds the same element by label-or-coordinate, and dispatches the tap.
Coming soon
Section titled “Coming soon”- The exact algorithm for assigning mark ids (z-order traversal, hit-testing)
- Element-type taxonomy (
Button,SecureTextField,Cell,Image, …) - How
observehandles overlays, modals, and Dynamic Island - The redaction pass for
SecureTextField(no rendered glyphs of a password field ever hit disk — see the redaction spec)