Test Analytics

Track test reliability and performance over time. Identify flaky tests, slow steps, and failure patterns to keep your test suite healthy.

For an overview of the Tests Dashboard and basic suite management, see Tests Dashboard.

Quick Start

Run a test suite multiple times to build up run history.
Open the suite detail page and scroll to the Analytics section.
Review pass rates, duration trends, flakiness scores, and failure groupings.

Where to Find Analytics

Analytics are available on the Suite Detail page (/tests/:suiteId). The page shows:

Stats cards at the top with high-level metrics.
Charts below the stats with visual trends.
Run history at the bottom with detailed per-run data.

Analytics require multiple completed runs to show meaningful data. A suite with only one run will show basic stats but no trend information.

Metrics

Pass Rate

The percentage of completed test runs that passed. Calculated as:

pass_rate = passed_runs / (passed + failed + error runs)

Pending, running, and cancelled runs are excluded from the calculation. A pass rate of 1.0 means every completed run passed.

The Case Reliability section breaks this down per test case, sorted by worst pass rate first. This helps you identify which specific test cases are dragging down suite reliability.

Average Duration

Mean execution time across all completed runs, shown in the stats card as a human-readable value (e.g., ”45s” or “2m 15s”). Use this to detect performance regressions — if average duration suddenly increases, a recent app change may have introduced slowness.

Flakiness Score

The flakiness score measures what percentage of test cases produce inconsistent results:

flakiness_score = cases_with_both_pass_and_fail / total_cases_with_runs

A test case is considered flaky if its last 10 completed runs include both passes and failures. A flakiness score of 0.0 means every test case consistently passes or consistently fails. A score of 0.5 means half your test cases are flaky.

VibeView only counts passed and failed runs for flakiness. Runs with error status (infrastructure issues like simulator crashes or session timeouts) are excluded since they do not indicate test flakiness.

Total Cost

Accumulated cost across all test runs in the suite, in micro-credits (1,000,000 = $1.00), split into two figures: real cost (the usage credit actually deducted for VibeView-provided runs — the billed amount) and an estimated cost for BYOK (Bring Your Own Key) runs. VibeView deducts nothing for BYOK; the estimate is computed from public list pricing and your provider bills you directly, so the two are shown separately and never combined into a single “spend” figure.

Slowest Tests

The top 10 slowest test cases ranked by average duration. Each entry shows:

Case name — the test case.
Average duration — mean execution time in milliseconds across all completed runs.
Run count — how many runs contributed to the average.

Use this to identify test cases that take disproportionately long and may benefit from optimization.

Failure Grouping

Failed and errored test runs are classified into categories based on their error messages:

Category	Description
Element Not Found	The AI agent could not locate a UI element.
Timeout	The operation timed out waiting for a condition.
Screen Unchanged	The screen did not change after an action.
Screen Mismatch	The screen did not match the expected state.
Assertion Failed	An explicit assertion did not pass.
Other	Errors that do not match a known category.

The failure grouping chart shows how many failures fall into each category, helping you prioritize fixes. For example, if most failures are “Element Not Found,” your test steps may reference UI elements that changed in a recent app update.

Charts

The analytics section displays four interactive charts:

Slowest Test Cases

A horizontal bar chart showing the top 10 test cases by average execution time. Bars are color-coded:

Coral — under 30 seconds (healthy).
Yellow — 30 to 60 seconds (worth investigating).
Red — over 60 seconds (likely needs optimization).

Hover over a bar to see the exact duration and number of runs.

Failures by Reason

A horizontal bar chart grouping all failures by their classified reason. Helps you see at a glance whether failures are concentrated in one category or spread across many.

Duration Percentiles (p50 / p95)

An area chart showing duration trends over time, with two lines:

p50 (median) — the typical run duration. A solid line.
p95 — the worst-case duration (95th percentile). A dashed line.

Data is aggregated by day. A widening gap between p50 and p95 indicates inconsistent performance — some runs are much slower than others.

Flaky Test Count Over Time

A step chart showing how many test cases were flaky on each day. A test case counts as flaky on a given day if it had both pass and fail results on that day.

Use this to track whether flakiness is improving or worsening over time.

Time Filtering

You can filter analytics data by time range using the days query parameter on the API:

GET /api/v1/tests/suites/{suite_id}/analytics?days=30

This returns analytics based only on test runs from the last 30 days. Omit the days parameter to include all historical data.

Analytics API

GET /api/v1/tests/suites/{suite_id}/analytics

Query Parameters:

Parameter	Type	Description
`days`	integer (optional)	Limit data to the last N days.

Response Schema:

{
  "suite": {
    "id": 1,
    "public_id": "checkout-flow-a1b2c3",
    "name": "Checkout Flow",
    "description": "...",
    "app_id": 5,
    "app_public_id": "shop-app-d4e5f6",
    "app_name": "Shop App",
    "platform": "ios",
    "case_count": 4,
    "visual_regression_warn_pct": 10,
    "visual_regression_fail_pct": 25,
    "created_at": "2026-01-15T10:30:00"
  },
  "stats": {
    "total_runs": 120,
    "pass_rate": 0.875,
    "avg_duration_ms": 45200,
    "total_cost_micro": 3400000,
    "real_cost_micro": 3100000,
    "est_cost_micro": 300000,
    "case_count": 4,
    "flakiness_score": 0.25
  },
  "case_pass_rates": [
    {
      "case_name": "Apply Promo Code",
      "pass_rate": 0.6,
      "total_runs": 30,
      "passed": 18
    }
  ],
  "runs": [
    {
      "id": "run-abc123",
      "type": "suite_run",
      "status": "passed",
      "started_at": "2026-03-19T14:20:00",
      "finished_at": "2026-03-19T14:21:30",
      "duration_ms": 90000,
      "cost_micro": 120000,
      "model_used": "claude-sonnet-4-20250514",
      "cases_passed": 4,
      "cases_total": 4,
      "test_runs": [
        {
          "run_id": "tr-xyz789",
          "case_name": "Standard Checkout",
          "status": "passed",
          "duration_ms": 22000,
          "tokens_used": 4500,
          "cost_micro": 30000,
          "error_message": null
        }
      ]
    }
  ],
  "slowest_tests": [
    {
      "case_name": "Full Checkout",
      "avg_duration_ms": 65000,
      "run_count": 25
    }
  ],
  "failure_groups": [
    {
      "reason": "element_not_found",
      "count": 8
    },
    {
      "reason": "timeout",
      "count": 3
    }
  ],
  "duration_percentiles": [
    {
      "date": "2026-03-18",
      "p50": 42000,
      "p95": 68000
    }
  ],
  "flaky_trend": [
    {
      "date": "2026-03-18",
      "flaky_count": 1
    }
  ]
}

Response Fields

Field	Description
`suite`	Suite metadata including app info and visual regression thresholds (`visual_regression_warn_pct`, `visual_regression_fail_pct`; either may be null when the tier is disabled).
`stats`	Aggregate statistics: total runs, pass rate, average duration, cost, case count, flakiness.
`case_pass_rates`	Per-case pass rates sorted by worst first. Includes total runs and pass count.
`runs`	Run history (last 50 suite runs + last 50 standalone runs), sorted by date descending. Suite runs include nested `test_runs` with per-case results.
`slowest_tests`	Top 10 slowest test cases by average duration.
`failure_groups`	Failure count by classified reason (element_not_found, timeout, etc.).
`duration_percentiles`	Daily p50 and p95 duration values for trend analysis.
`flaky_trend`	Daily count of test cases that had both pass and fail results.

Tips

Run suites consistently to build reliable trend data. Sporadic runs make trends harder to interpret.
Use flakiness scores to prioritize test maintenance. A flaky test erodes confidence in your test suite more than a consistently failing one.
Compare duration percentiles (p50 vs p95) to catch performance regressions. If p95 spikes while p50 stays flat, you have an intermittent slowness issue.
Check failure groupings after app updates. A sudden spike in “Element Not Found” failures usually means UI element identifiers changed.
Filter by time range (?days=7) to focus on recent trends when investigating a regression.
Combine with the Maintenance tab on the Tests dashboard for a cross-suite view of flaky and failing tests across your entire organization.