AI Testing (Beta)

VibeView’s AI testing system lets you define automated test cases that run on real iOS and Android simulators. Tests can replay recorded interactions, use an AI agent that understands your app’s UI, or combine both approaches for maximum reliability.

Test Suites

A test suite is a collection of related test cases that belong to a specific app. Suites provide shared configuration that applies to all cases within them.

Key properties

App association — Each suite is associated with a single app. The selected app determines which app build is installed and run; a suite must have an app assigned before it can run (the run button is disabled otherwise). You view and manage suites from the Tests sidebar (click a suite to open its page).
Sub-flows — Attach reusable standalone cases (for example, a login flow) to a case so shared steps are defined once and reused everywhere. See Sub-Flows.
Reset strategy — Control how the app state is cleaned between cases. The clear_data strategy wipes app storage before each case, ensuring a consistent starting point.
Context — A free-text field describing the app and its behavior. This context is included in the AI agent’s prompt, helping it make better decisions about how to interact with your app.
Variables — Key-value pairs shared across all cases in the suite. Variables can be substituted into test steps, allowing you to reuse the same case logic with different data.

All of the above — including Visual Regression thresholds — are configured together on the suite’s Configuration tab (Tests → your suite → Configuration) and saved with a single Save Configuration button. See Visual Regression Testing for details on the threshold settings.

Test Cases

Each test case defines a sequence of steps to execute on the simulator. Cases support two execution modes.

Recorded mode

In recorded mode, steps are defined as structured JSON actions (tap at coordinates, type text, swipe in a direction). The runner replays these actions exactly as recorded. This mode is fast and deterministic but can break when the UI layout changes.

AI mode

In ai mode, steps are written as natural language instructions (for example, “Tap the Sign In button” or “Scroll down until you see the Settings option”). The AI agent interprets each step by reading the current UI tree and screenshots, then decides which actions to perform.

Creating test cases

Test cases are created and edited on the suite page. From the Tests sidebar, click a suite to open its page, then use the Cases tab to manage test cases. You can:

Write natural language steps for AI-driven execution.
Define structured JSON steps for recorded replay.
Attach reusable sub-flows (like a shared login flow) alongside case-specific steps.
Reference suite variables in your step definitions using {{key}} placeholders.

Hybrid Mode

Hybrid mode combines recorded replay with AI fallback. The runner first attempts to replay the recorded steps. If a step fails — typically because the UI has changed due to dynamic content, A/B tests, or layout shifts — the AI agent takes over and attempts to accomplish the same goal using vision and the UI tree.

This approach gives you the speed of recorded replay with the resilience of AI-driven testing.

Hybrid step generation. When a recording contains user-defined step ranges (see “Step Ranges” in the device sandbox docs), the AI uses those ranges as deterministic anchors and only infers steps for the unbracketed gaps between them. This means you can isolate the exact step grouping you want — name it, lock its boundaries — and let the AI handle the rest. If you don’t define any ranges, the AI infers all step boundaries from scratch as before.

Per-Step Force AI Mode

In hybrid mode, you can mark individual steps as Force AI to bypass recorded replay for that step and route it directly to the AI agent. This is useful when a step requires the AI to actively verify screen content rather than blindly replaying recorded gestures.

When to use Force AI

Verification steps — When you need the AI to confirm that specific content appears on screen (for example, checking that a profile name was updated correctly).
Dynamic content — Steps where the UI varies between runs and recorded coordinates are unreliable.
Assertion-like checks — Steps that should evaluate the current state rather than perform a fixed interaction.

Custom assertion context

Each Force AI step can include an optional assertion context — a short description of what the AI should verify. This context is appended to the step instruction so the AI knows the specific criteria to check.

For example, if your test records a profile name change from “abc” to “abpd”, you could mark the verification step as Force AI with assertion context: profile name must be abpd, not abc. The AI will then read the screen and confirm the name matches, rather than replaying a tap on the old value.

Behavior

Steps with Force AI enabled skip the replay block entirely and go straight to AI execution.
Steps without Force AI behave exactly as before — no change to existing test behavior.
Force AI works in both individual test runs and suite runs.

Editing Recorded Actions in Hybrid Tests

When a test case has a source recording (hybrid mode), you can edit the underlying gestures attached to each step directly from the test case page.

How to edit

Open the test case from the Tests dashboard.
Click Edit to enter edit mode.
For any step with a recording segment, click the Actions toggle (▸ Actions) below the step to expand the gesture list.
Each gesture row shows a summary line (for example, tap @ (120, 340) or swipe ↓) and editable fields specific to its gesture type.
Click Save on the test case to persist your edits.

What you can edit

Taps and long presses — The text, accessibilityId, resourceId, and className element selectors that the replay engine uses to resolve the gesture’s target. Updating these changes how the replay engine finds the element on future runs, without changing the recorded coordinates.
Swipes — Swipes replay the recorded finger motion itself, so they reproduce scrolls, flicks, and drags faithfully — including diagonal swipes and gestures that don’t land on a specific element. The motion is recorded relative to the screen, so a swipe recorded on one device replays at the proportional position on a different screen size, and works across platforms. (When replaying across platforms, the gesture is faithful but momentum scrolling can settle slightly differently because iOS and Android use different fling physics.)
Device button presses — The button name (for example, home, back).
Text input — The typed text.
Delete — The X icon on each row removes the gesture from the recording. Deleting a swipe shows a confirmation prompt; other gestures are removed immediately.

Coordinates and gesture timing are not editable — to change those, re-record the step.

Step-to-recording validation

When you edit or delete gestures, VibeView revalidates the mapping between each step and its recording anchor. If a step’s source_action_index no longer points at a valid gesture after your edits, an inline error appears under that step explaining which anchor is missing. Fix the mapping (or restore the deleted gesture) before saving.

Wait between steps

Sometimes a step finishes an action (like tapping Submit) but the app is still loading the next screen. By default we wait for the UI to settle automatically before checking the result. For cases that don’t fully detect (such as a steady-state loading spinner whose layout doesn’t change while data loads), you can configure an explicit wait.

In the test editor, hover the gap between two steps and click + Add wait. Enter a duration (in seconds or milliseconds, up to 5 minutes) and save. The wait is applied after the step’s last action and before the next step starts.

The wait travels with the step it follows — moving or deleting the step also moves or deletes the wait. Waits only apply to hybrid and replay-only modes; pure AI mode ignores them (the AI can call wait itself).

Replay Details & Debugging

When a test step runs in hybrid or replay mode, the step timeline shows an expandable Replay Details section with a comparison table. The table has three columns — Property, Recorded, and Executed — showing how the original recording maps to what actually ran on the simulator.

The comparison covers:

Element — The recorded element identifier (accessibility ID, text, or resource ID) and how the replay engine resolved it (by element match, coordinates, or fuzzy fallback).
Coordinates — The original recorded coordinates alongside the coordinates that were actually executed after element resolution and adapting to the current device’s screen size.
Confidence — A color-coded confidence score showing how well the current UI matched the recorded element. Green (80%+) means a strong match, yellow (50-79%) is acceptable, and red (below 50%) indicates a weak match that may trigger a retry or AI fallback.
Resolved via — The resolution strategy used (coordinates, element, element_fuzzy, structural, or not_found).

A retry badge appears next to the mode indicator when the replay engine retried a gesture. For example, “1 retry” means the first attempt failed and a second attempt succeeded.

When a replay step falls back to the AI agent, a fallback reason warning appears in the expanded step details explaining why replay was abandoned (for example, low confidence on the target element).

Fuzzy Element Matching

By default, hybrid replay can fall back to a case-insensitive text-substring match when an element’s exact accessibility id or structural selector chain no longer resolves. This keeps tests working through minor label changes (for example, a button renamed from Sign in to Sign In) but occasionally taps the wrong element when two unrelated controls share similar text.

On the AI run panel in the Device Sandbox, the Fuzzy element matching selector controls this behavior per run:

Use test case default — Honors the fuzzy_match flag stored on the test case (defaults to on).
On — text-substring fallback allowed — The substring fallback path runs after exact matching fails.
Off — strict: unknown elements route to AI — The substring fallback is skipped. Any element that doesn’t resolve by exact id or structural chain routes to the AI fallback (or fails the step in pure replay mode).

Use Off for scenarios where a mislabelled replay would be worse than an AI handoff — for example, a screen with several similar-looking buttons where tapping the wrong one would put the app into an unrecoverable state.

The per-case default is persisted by the API and in YAML exports. There is no inline UI toggle on the test case page to change it; update it through YAML import or the API when you want the default flipped for CI runs.

AI Execution

When the AI agent runs a test step, it follows a loop:

Read the UI tree — The agent requests the current accessibility tree from the simulator. The tree is pruned and filtered to show only interactive elements, each tagged with a short reference like @e1, @e2.
Take a screenshot — The agent captures the current screen to provide visual context alongside the structured tree.
Decide an action — Using a tool-calling LLM, the agent selects from the available tools based on the step instruction and current UI state.
Execute the action — The chosen tool runs on the simulator via the native automation layer on each platform.
Repeat — The agent continues observing and acting until the step objective is met or a failure condition is reached.

The agent has a default cap of 8 turns per step. If the goal is not reached within that many iterations, the step is marked failed with a max iterations reached reason. Tight, specific step text (and relevant suite context) keeps the agent on the fast path.

Tap verification

After every tap, the agent compares the screen before and after. It first checks the whole screen for change; if that looks unchanged, it re-checks just the region around the tapped element. The two-stage check catches small state changes — a heart outline filling in, a checkbox toggling, a radio button selecting — that a whole-screen comparison would miss.

The result is surfaced on the tap response as one of:

screen_changed — whole-screen diff detected movement.
localized_change_detected — whole screen looked unchanged, but the local region around the tap moved.
weak_change_detected — sub-threshold movement in the local region (common for small indicators); treated as a successful tap.
unchanged — literally zero pixel change; the agent treats this as a missed tap and takes corrective action.

Scrolling

scroll_until_visible is the single scroll-search tool. Pass a text substring or accessibility id — it finds the right scrollable container, detects the scroll direction automatically, and fine-tunes until the element lands fully inside the viewport. To bring a previously-seen element back into view, pass its visible text or accessibility id. For lazy-loaded content that only appears in one direction, the tool falls back to bidirectional scrolling. It also detects end-of-scroll: when the scrollable content stops changing between passes, it stops early and the model receives a “likely not on this page” warning rather than looping indefinitely.

Hittability gate

On iOS, the agent checks that an element is actually hittable before tapping it. If the element is hidden behind something else or has no size, the tap is rejected with a not hittable error instead of firing at invisible UI — so the agent doesn’t silently tap covered elements.

Available tools

The agent has 22 tools on iOS touch devices and 21 on Android (Android omits the iOS-only select_picker_value). Two additional tools (tap_focused_tv, focus_element_tv) are available on both Apple TV and Android TV sessions: they move focus to a target element (and, for tap_focused_tv, press SELECT) instead of chaining press_button(dpad_*) presses. On Apple TV they use the platform’s built-in focus navigation; on Android TV VibeView navigates the d-pad toward the target until focus lands. On Android TV they require an app whose focus is visible to the device’s accessibility layer (react-native-tvos / Leanback); custom-JS-focus apps fall back to plain d-pad navigation — see TV testing for the full picture.

Interaction:

Tool	Purpose
`tap`	Tap an element by its UI tree ref (e.g. `@e5`)
`tap_coordinates`	Tap at raw pixel coordinates (fallback when no ref available)
`tap_and_type`	Tap an input field and type text in one call — avoids the false `unchanged` signal a bare tap on a text field produces
`long_press`	Touch and hold an element by ref (default 800ms)
`drag`	Drag between two refs or coordinates with configurable velocity
`swipe`	Directional swipe (up/down/left/right) with configurable distance
`scroll_until_visible`	Scroll until an element matching a text substring appears — pass a known element’s text or accessibility id to bring it back into view
`gesture_preset`	Execute a named gesture (`scroll_down`, `back_swipe`, `pull_to_refresh`, etc.)

Text and keys:

Tool	Purpose
`type_text`	Type text into the focused input
`clear_text`	Clear the focused input
`keys`	Send a keyboard shortcut with modifiers (e.g. `cmd+a`, `shift+tab`). iOS 17+

System and device:

Tool	Purpose
`press_button`	Press home, back, lock, or siri; on TV, press dpad_up/down/left/right/center or back
`alert`	Inspect and dismiss an iOS system alert by button label
`select_picker_value`	Set a picker wheel column to a target value (iOS only; Android uses tap/swipe)
`wait`	Pause for N seconds

Query and assertions:

Tool	Purpose
`locate`	Find element coordinates by visible text, label, or identifier
`find_element`	Find an element by text with spatial constraints (below, above, near another element)
`assert_visible`	Assert that an element with given text appears within a timeout
`assert_not_visible`	Assert that an element has disappeared within a timeout
`assert_value`	Assert an element’s text/value matches expected (exact or substring)

Step completion:

Tool	Purpose
`step_complete`	Signal that the step objective is met
`step_failed`	Signal that the step cannot be completed, with a reason

The agent uses element references (@e1, @e2, etc.) from the UI tree to target elements precisely. References are stable within a step — the same element keeps its ref even if the tree order shifts slightly between iterations.

When an assertion fails, the test case is marked as failed and the agent logs the reason along with a screenshot of the actual state.

Visual Regression During AI Runs

AI and hybrid runs capture a baseline-comparison screenshot after every step when the suite has visual regression enabled. Drift that exceeds the suite’s warn threshold is flagged on that step in the timeline; drift past the fail threshold fails the run. Configuration, thresholds, and baseline acceptance flow are documented in visual-regression.md.

Suite Runs

A suite run executes all test cases in a suite sequentially on a single simulator session. This means:

The app stays installed across cases (unless the reset strategy clears data).
Sub-flows attached to a case run as part of that case (see Sub-Flows).
Cases execute in their defined order.
If one case fails, subsequent cases still run.

Aborting a suite run

While a suite is running, the Suite Run panel shows two abort buttons:

Abort Case — cancels the currently running test case. The case is recorded as cancelled, and the suite continues with the next case. Use this when one case is stuck or known-bad and you want the rest of the suite to run.
Abort Suite — cancels the entire suite. The current case is cancelled, and all remaining cases are recorded as cancelled without executing.

Between cases (during the app reset window), Abort Case is briefly disabled — Abort Suite is always available while the suite is running.

When you run a single test case (not a suite), only one Abort button is shown, with the same single-run behavior as before.

Test Run Results

Each test run produces a detailed result record. Every run has a dedicated page at /tests/runs/:id that you can open directly, bookmark, or share with teammates. The URL works for both suite-attached and standalone test runs.

Run detail page

The run detail page shows:

Status and duration in a prominent stats bar at the top.
Model used and usage credit / cost statistics.
Step timeline — a visual breakdown of each step with pass/fail status, screenshots, and duration.
AI reasoning log — the agent’s decision-making trace showing observations and actions.
Visual regression — side-by-side comparison with baseline screenshots (when available).

For in-progress runs, the page auto-refreshes every few seconds until the run reaches a terminal status (passed, failed, or error). You can also set a passing run as the new visual regression baseline directly from the detail page.

Status values

Status	Meaning
`queued`	Run is waiting its turn (for example, behind other cases in a suite run)
`pending`	Run is created but has not started
`running`	Execution is in progress
`passed`	All steps and assertions succeeded
`failed`	One or more assertions or steps failed
`cancelled`	The run was cancelled by the user (via Abort Case or Abort Suite)
`error`	An unexpected error interrupted execution
`stopped_insufficient_credit`	The run was stopped because your organization’s usage credit ran out (shown as “Out of credit”)

Result data

Duration — Total wall-clock time from start to finish.
Screenshots — Captured at key moments: before and after actions, on assertion failures, and at step completion.
Reasoning log — The AI agent’s decision-making trace, showing what it observed and why it chose each action.
Model used — Which LLM provider and model executed the run (Anthropic, OpenAI, Google, or OpenRouter).
Token usage — Input and output token counts for the LLM calls (diagnostic; your bill is the Cost below, not the raw token count).
Cost — for VibeView-provided runs, the usage credit actually deducted for this run (the real billed amount, in USD). For BYOK runs, VibeView deducts nothing, so the Cost is shown as an estimate (~$X (est.)) computed from public list pricing — your provider bills you directly and the exact amount may differ. Estimated (BYOK) and actual (provided) costs are always shown separately, never combined into one figure. See Usage Credits.

Suite Context and Variables

Context

The suite context field accepts free text that describes the app under test. This information is injected into the AI agent’s system prompt, giving it domain knowledge it would not otherwise have. Good context includes:

What the app does and its primary workflows.
Terminology specific to the app (custom labels, feature names).
Known quirks or behaviors the agent should expect.

Variables

Suite variables are key-value pairs that you reference in step text using {{key}} placeholders. For example, define username=testuser@example.com at the suite level, then write a step like Type {{username}} into the email field. At run time VibeView substitutes the value before the step executes. Substitution applies whether you run a single case or the whole suite, and it also rewrites matching values in recorded actions. This keeps credentials and test data centralized and easy to update.

YAML Export / Import

Test suites can be exported as YAML files for version control, sharing, or migration between VibeView instances.

Exporting

From the suite detail page, click Export YAML. The downloaded file includes:

Suite name, description, context, and reset strategy
All test cases with step descriptions and source_action_index mappings
Source recording data (gestures, element selectors, screen fingerprints)
Source device dimensions for coordinate scaling
App bundle ID for automatic app matching on import

Importing

From the Tests dashboard, click Import YAML and select a .yaml file. The importer:

Creates a new test suite with all cases and recordings
Automatically links to an existing app if the bundle ID matches
Preserves step-to-recording pairing so hybrid replay works immediately

This enables portability: record a test on one machine, export, and import on your deployed instance for demo or CI runs.

React Native Accessibility

The AI agent reads the iOS accessibility tree to understand your app’s UI structure. React Native apps can sometimes produce incomplete UI trees due to how iOS accessibility containers work.

The issue

When a React Native Pressable, TouchableOpacity, or TouchableHighlight wraps a TextInput, the touchable component acts as an accessibility container. iOS treats the entire container as a single element and hides its children from the accessibility tree. This means the native UITextField inside the wrapper becomes invisible to VibeView’s UI tree, and the input appears as a generic element rather than a text field.

This is a known iOS accessibility-container limitation that affects testing tools across the industry, including Appium, Detox, and Maestro.

How VibeView handles it

VibeView’s AI agent uses screenshots as its primary input, not just the UI tree. When the UI tree is incomplete (for example, showing generic elements instead of text fields), the agent falls back to visual understanding of the screenshot to identify inputs, buttons, and other interactive elements. It uses coordinate-based taps and text injection to interact with elements that are missing from the tree.

This means VibeView works with React Native apps even when the accessibility tree is incomplete, though providing better accessibility metadata will improve test reliability.

Recommended fix

For the best testing experience, set accessible={false} on wrapper components that contain interactive children:

{/* Before: TextInput hidden from accessibility tree */}
<Pressable onPress={handleFocus}>
  <TextInput placeholder="Email" />
</Pressable>

{/* After: TextInput visible as a proper text field */}
<Pressable accessible={false} onPress={handleFocus}>
  <TextInput
    accessible={true}
    accessibilityLabel="Email"
    testID="email-input"
    placeholder="Email"
  />
</Pressable>

This exposes the TextInput as a native text field in the accessibility tree, allowing the AI agent to identify it correctly and use optimized text input methods.

Tips

Start with recorded mode for stable, well-defined flows. Switch to AI mode for screens with dynamic content.
Use hybrid mode as a default when you want reliability with a safety net.
Write specific, actionable step descriptions for AI mode. “Tap the blue Submit button at the bottom of the form” works better than “Submit.”
Keep suite context concise but informative. The agent performs better with clear domain knowledge.
Use the clear_data reset strategy when cases should not depend on state from previous cases.
Review the reasoning log when a test fails unexpectedly. It shows exactly what the agent saw and why it made each decision.
Export your test suite as YAML before major app updates so you can re-import if needed.
For React Native apps, set accessible={false} on Pressable/TouchableOpacity wrappers around TextInput components to improve UI tree accuracy.