Visual Regression Testing

Visual regression compares each step’s screenshot against a stored baseline during test execution. Steps that drift beyond configured thresholds mark the step and the run with a visual verdict (warn or fail), and a failing step aborts the rest of the run.

Enabling on a suite

Open a test suite and go to the Configuration tab (on the suite detail page) to find the Visual Regression controls. It has two independent checkboxes, each with a numeric percentage input:

Warn if differs by more than — default 10%. When checked, any step whose screenshot differs from its baseline by at least this percentage gets a warn verdict.
Fail if differs by more than — default 25%. When checked, any step that crosses this threshold gets a fail verdict; the step fails and the run aborts at that point.

Click Save Configuration to apply (this saves the visual-regression thresholds together with the suite’s app, reset strategy, context, and variables). If the warn threshold is set at or above the fail threshold, the UI shows Warn threshold should be lower than fail threshold. — fix it before saving.

Leaving both checkboxes unchecked disables visual regression for the suite. No baselines are written, no comparisons run, no cost is added at execution time.

Equivalent API call:

PATCH /api/v1/tests/suites/{suite_id}/settings
{
  "visual_regression_warn_pct": 10,
  "visual_regression_fail_pct": 25
}

Set either field to null to disable that tier.

Per-step overrides

Suite thresholds are the default for every step, but individual steps can opt out of visual regression or use their own thresholds. This is the escape hatch for screens that never match a baseline (live video, channel feeds) or that vary within known bounds (timestamps, rotating banners) and need looser tolerances than the rest of the suite.

Open a test case and find the VR badge in each step’s header row, next to the AI and delete buttons. Click it to open the per-step popover with three radio options:

Use suite defaults — the step inherits the suite’s warn/fail thresholds. This is the default; no badge is shown. The popover hint displays the current suite values (e.g. warn 10%, fail 25%).
Disabled — visual regression is skipped entirely for this step. No baseline is written and no comparison runs. The step shows a struck-through VR off badge.
Custom thresholds — set this step’s own warn and/or fail percentages, using the same two checkboxes-with-percent inputs as the suite-level controls. Enable at least one tier (custom mode with both disabled is rejected — use Disabled instead). If the warn threshold is at or above the fail threshold, Apply stays disabled until you fix it. The step shows a VR 40/70% badge (a — means that tier is off).

A custom override wins even when the suite has visual regression turned off — a single step set to Custom 10/25 is still evaluated while every other step is skipped. Conversely, a step set to Disabled is skipped even when the suite has thresholds configured.

Read-only views (run history, the step timeline) show the badge only when a step carries a non-default override, so an all-default test stays uncluttered.

How the override is stored

There is no migration for this feature. The override lives as an optional visual_regression object on each step inside the test case’s steps_json — the same pattern as post_delay_ms. A step with no visual_regression field inherits the suite defaults.

{
  "action": "tap",
  "visual_regression": { "mode": "custom", "warn_pct": 40, "fail_pct": 70 }
}

Field	Values	Meaning
`mode`	`"default"` \| `"disabled"` \| `"custom"`	`default` (or field absent) inherits the suite; `disabled` skips VR; `custom` uses the step’s own thresholds.
`warn_pct`	int 0–100, or `null`	Only with `mode: "custom"`. `null` disables the warn tier for this step.
`fail_pct`	int 0–100, or `null`	Only with `mode: "custom"`. `null` disables the fail tier for this step.

Validation runs on POST/PATCH of a case (422 on a bad shape): mode must be one of the three values; warn_pct/fail_pct are only allowed when mode is custom; custom requires at least one threshold; and when both are set, warn_pct must be lower than fail_pct. Reads are forgiving — an unknown or malformed shape falls back to suite defaults rather than erroring.

Overrides apply uniformly across recorded, hybrid, and pure-AI execution modes.

How the drift percentage is computed

VibeView compares each step screenshot against the baseline in three tiers:

Perceptual hash — byte-identical or near-identical frames short-circuit to 0% instantly.
Alignment — the global offset between the two frames is estimated (scroll replays can land a few pixels off; TV rows scroll horizontally). For steps that performed a scroll or swipe, any offset is accepted; for all other steps the offset is only accepted up to ~12px — a larger shift means something moved the layout, and it is reported as drift.
Changed pixels — a pixel counts as changed if it can’t be explained by the other frame at an accepted alignment: it must fall outside the local texture envelope and match local average brightness. The local-average check catches different content occupying the same busy region (a different poster or paragraph), which texture tolerance alone can miss.

This means a replay that lands 4px lower on the page, a status-bar clock tick, or re-encoded image quality do not inflate the drift score. Real changes — a missing element, changed text, a moved tvOS focus ring — are still counted at their true size. The same logic applies to every platform (iOS, Android, tvOS, Android TV, Roku).

Dynamic content

Auto-advancing carousels, video thumbnails, and timestamps genuinely change between runs and will contribute real drift. Either keep your warn threshold above their typical contribution (the default 10% accommodates a banner carousel) or split volatile screens into their own steps with a per-step visual_regression threshold override.

How each step is classified

When either threshold is configured, the backend runs a comparison after each step. The changed_pixel_pct it returns is classified:

If the fail threshold is configured and changed_pixel_pct >= fail_pct → fail.
Else if the warn threshold is configured and changed_pixel_pct >= warn_pct → warn.
Otherwise → ok.

Fail takes precedence when both thresholds would fire.

On fail, the step is marked failed, an error message is recorded, and the run aborts at that step. Remaining steps are skipped.

On warn, the step records the visual verdict and a warning but execution continues.

On ok, nothing user-visible happens.

Reference image selection

For each step, visual regression looks for a reference in this order:

Stored baseline — a VisualBaseline row for this (suite, case, step). Auto-created from the first passing run when a threshold is enabled.
Recording screenshot — hybrid tests attach a screenshotBefore on every recorded gesture. Before any baseline exists, that stands in as the reference.
Skip — if neither exists, the step records visual_verdict: null and execution continues normally.

If a step’s screenshot came from a device with different dimensions than the baseline, the comparison resizes before comparing, so cross-device replay still works. Comparisons always run at the larger of the two image resolutions, so low-resolution recording references (stored at 328px width) don’t dilute sensitivity.

The reference image on run pages

When a step drifts, the run page stores a frozen copy of the exact baseline it was compared against. Accepting new baselines later does not change what a past run displays — the score, diff overlay, and reference image always agree. Runs from before this feature fall back to showing the current baseline.

How the run verdict aggregates

Every step’s verdict rolls up into a single visual_regression_status on the run, with this priority: fail > warn > ok > null.

fail — at least one step hit the fail threshold.
warn — no fails, but at least one step hit the warn threshold.
ok — at least one comparison ran, and all were ok.
null — no comparisons ran (thresholds disabled, or no reference screenshot existed for any step).

Suite runs aggregate the same way one level up: a suite run’s visual_regression_status is the max-severity verdict across its case runs (same fail > warn > ok > null priority), computed from the case runs at read time rather than stored. A suite run whose cases all passed but where any case drifted shows the orange “passed” everywhere a suite run appears — the Run History row, the trend dots, and the suite run detail header — matching the case rows inside it.

Mid-run comparison in AI and hybrid modes

AI and hybrid runs call compare_step_inline after each step completes. That means:

A step’s visual verdict is known before the next step starts.
A fail verdict aborts the run immediately — no wasted time on the rest of the test.
The step timeline on the run detail page shows the visual badge per step in real time.

Pure replay runs use the same comparison logic after every step as well.

Accepting a new baseline from run detail

When your UI intentionally changes, the baseline needs updating. There are two ways to do this from the run detail page:

Per step — in the step timeline, a step with a warn or fail visual badge shows an Accept as baseline link next to the badge. Clicking it replaces only that step’s baseline with the screenshot from this run. Use when drift is localized.
Whole run (all steps) — the Set as Baseline button at the bottom of the run detail page (shown only when the run passed) opens an Update Baseline confirmation. Confirming replaces every baseline for this test case with screenshots from this run. Use when most of the test’s screens have changed, for example after a UI redesign.

Both actions are developer-role scoped and audit-logged. The underlying endpoints:

POST /api/v1/tests/runs/{run_id}/set-baseline                # bulk
POST /api/v1/tests/runs/{run_id}/set-baseline/drifted        # drifted steps only
POST /api/v1/tests/runs/{run_id}/set-baseline/{step_index}   # per step

The /drifted variant updates baselines only for the steps whose visual verdict was warn or fail, leaving clean steps untouched.

The run detail page shows a banner above the step timeline based on the run’s aggregated visual_regression_status:

fail (red banner) — Visual regression badge, subtitle Run failed due to a step differing beyond the fail threshold. The failing step is visible in the timeline below.
warn (yellow banner) — Visual regression badge, subtitle Some steps had visual drift above the warn threshold. The run completed; review the flagged steps and accept new baselines if the drift is expected.
ok / null — no banner shown.

From either state, use the per-step Accept as baseline links or the whole-run Set as Baseline action to update baselines.

Runs-list status dot

Run tables on the Suite Detail, Test Case Detail, and Standalone Test Detail pages show a small colored dot inline in the run’s Status cell:

Red dot — tooltip Visual regression failed. The run’s aggregated status is fail.
Yellow dot — tooltip Visual drift detected. The run’s aggregated status is warn.
No dot — the run is ok or visual regression didn’t run.

The top-level Tests Dashboard displays suite-level summaries only; the per-run dot appears once you drill into a specific suite or test.

Per-step result fields

Every step in result_json carries these fields (all null when the comparison didn’t run):

Field	Description
`visual_verdict`	`"ok"` \| `"warn"` \| `"fail"` \| `null`
`visual_score`	Changed-pixel percentage (0–100)
`visual_ssim_score`	SSIM similarity (0.0–1.0, higher = more similar)
`visual_hash_distance`	Perceptual-hash hamming distance (0 = identical)
`visual_diff_image_url`	URL to the red-overlay diff image (warn or fail only)
`visual_baseline_image_url`	URL to the reference image
`visual_current_image_url`	URL to the step’s actual screenshot
`visual_reference_source`	`"baseline"` (stored) or `"recording"` (first-run fallback)
`visual_reference_image_url`	URL to the frozen reference copy used for this run’s comparison (warn or fail only; see The reference image on run pages)

Cost

Suites with both thresholds disabled: no added cost at execution time.
Suites with at least one threshold enabled: roughly 100–200 ms of comparison per step that reaches tier 2. Hash-identical steps short-circuit in single-digit ms. A 10-step test typically adds 1–2 seconds per run.
Fail-threshold aborts save time — the run stops at the divergent step instead of continuing through the remainder.

Per-device baselines

Baselines are tracked per device. The first time a test runs on a device that has no baseline yet, the comparison falls back to the recording. After a clean-passing run — or after you manually accept — that device gets its own baseline. Other devices keep their existing baselines untouched, so a test recorded on iPhone 15 can accumulate an iPhone 17 Pro Max baseline without overwriting the iPhone 15 one.

Browsing baselines per device

The Baselines section on a test case page has a device selector at the top. Pick a device to see its baselines step-by-step. The Recording tab shows the fallback the system uses on devices that do not have a baseline yet.

Clearing a device’s baselines

When a device’s baselines no longer match (for example, after a UI redesign), select the device tab and click Clear. The rows and image files for that device are deleted; other devices’ baselines are unaffected. The next run on that device falls back to the recording until you accept a new baseline.

Accept buttons on the run detail page

The full-run and per-step accept buttons both label themselves with the run’s device — for example, Set as iPhone 16 Pro Max baseline. Accepting only writes the baseline for that run’s device.

Note: Thresholds can be set suite-wide or overridden per step, but not per device — there is no per-device threshold override today.

Device identity under pooling

When several identically-named devices are available (a “pool” of replicas), two different identity rules are at work — and they answer two different questions:

Pinning / reconnect resolves to the device’s UDID. When you pin a session to a specific device, or reconnect to one, VibeView matches on the device’s stable hardware identifier (its UDID, stored as device_id) — not its display name. Two phones that both show as “iPhone 15” have distinct UDIDs, so a pin always targets exactly one physical device even when the names collide.
A baseline resolves to (model name, OS version). Visual-regression baselines are not tied to a single physical device — they are keyed on the device model name together with its OS version. Two identically-named replicas on the same OS version share one baseline (they render the same screens, so one reference is correct for both). Replicas on different OS versions get separate baselines, because OS-level rendering differences (status bar, system fonts, dialogs) would otherwise cause false diffs.

So a pin and a baseline are distinct concepts: a pin targets one physical device by UDID, while a baseline applies to a whole (name, OS version) pool. In the Baselines section, a device that has baselines on more than one OS version appears as separate entries — for example iPhone 15 (iOS 17.4) and iPhone 15 (iOS 18.0) — so clearing or promoting one version never touches the other.

Cross-version promotion is not supported. Because OS versions render differently, baselines cannot be promoted from one OS version to another. When a device moves to a new OS version, it starts with no baseline and the next run falls back to the recording until you accept a fresh baseline. (Same-version promotion across two different models of the same form factor is still available.)

Webhook Integration

Visual regressions can fire webhook events to your own backend or a Slack channel. Three event types are available:

visual.regression_warning — fires mid-run when a step’s diff crosses the warn threshold.
visual.regression_detected — fires mid-run when a step’s diff crosses the fail threshold (the step also fails).
visual.run_completed_with_regressions — fires once after a run that had at least one regression, with a summary count.

The two mid-run events fire whether the reference is an accepted baseline or the original recording (first-run fallback). Each payload includes a reference_source field ("baseline" or "recording") so consumers can filter out first-run noise if they want — see Webhooks and Slack for the full payload schema.

Subscribe an endpoint to these events under Settings → Integrations.