•
Drizz raises $2.7M in seed funding •
•
Featured on Forbes
•
Drizz is now Live on ProductHunt! Support Us with Upvotes and Comments
Upvote now
Logo
Schedule a demo

We benchmarked 6 AI models
on mobile tap precision.
GPT-5.1 scored 21%.

Spatial accuracy is the most common failure point for mobile AI agents. UI-TapBench measures exactly this — across 570 annotated screenshots from 20 real production apps.

94.51%
Drizz accuracy
21.72%
GPT-5.1 accuracy
97.18
Drizz F1 score
570
Annotated screenshots

Six models. One task.
A clear hierarchy.

All models evaluated zero-shot on identical prompts. A tap is correct if predicted coordinates fall within the ground-truth bounding box. Sorted by accuracy.

94.51%

Drizz tap accuracy — highest across all six models evaluated

Drizz · Accuracy

21.72%

GPT-5.1 accuracy on the same dataset — a model expected to perform far higher

GPT 5.1 · Accuracy

4×

Gap between the best and worst F1 scores —spatial precision is the differentiator

97.18 vs 35.68 · F1
Model Accuracy Precision Recall F1 Score Latency p90
Drizz #1
94.51%
96.22%98.16%97.184.81s
Qwen 3.5-27b
92.98%
94.98%97.61%96.285.92s
Gemini Pro
89.84%
91.28%98.28%94.6512.27s
Gemini Flash
81.44%
83.78%96.67%89.778.19s
GPT-5.2 Low
44.83%
45.71%95.88%61.915.0s
GPT-5.1 Low
21.72%
23.35%75.61%35.685.55s
p90 = 90th percentile inference time. Zero-shot. Apache 2.0 dataset.

Get this as a 1-page custom report

Your team's exact productivity numbers + benchmarks from 3 teams like yours. Formatted for forwarding straight to your VP or engineering lead.

See Drizz in action
No spam. One follow-up email at most.
Oops! Something went wrong while submitting the form.
See Drizz run a flow from your app. Most teams ship their first test in 20 minutes
get a 30-in walkthrough →

Why 1.5% accuracy is not a small number

Drizz leads Qwen by 1.53 percentage points. In a CI/CD pipeline running 50 test cases with 20 taps each, that gap creates 300 extra failures every day.

Drizz · 94.51% accuracy
Errors per 1,000 taps
55
Test cases affected / run
~5–6
Daily failures (20 CI runs)
1,100
15-step flow success rate
42.8%
Latency p90
4.81s
Qwen 3.5-27b · 92.98% accuracy
Errors per 1,000 taps
70
Test cases affected / run
~7
Daily failures (20 CI runs)
1,400
15-step flow success rate
33.0%
Latency p90
5.92s
Cascading failure math
Mobile test automation is sequential. A wrong tap on step 3 of a 15-step flow doesn't just fail step 3 — it corrupts app state through steps 4–15. At 94.51% per-tap accuracy, Drizz completes a 15-step flow without error 42.8% of the time. Qwen at 92.98% drops to 33.0%. A 1.53-point accuracy gap translates to a 30% higher probability of end-to-end success.

Why GPT-5.1 scores 21%
on a task it should pass

This is not a language failure. GPT-5.1 understands "tap the second option in the list" perfectly. It fails to map that instruction to the correct pixel on a dense mobile screen.

Dense layouts require precise counting, not comprehension

Mobile UIs pack dozens of tappable elements into small screens. "The third item" requires iterating over visually similar rows and landing on the exact bounding box — not recognising a region.Connect your app. Write tests in plain English. Start testing in minutes.

General training doesn't calibrate for pixel-level spatial tasks

LMMs trained on diverse image understanding tasks lack fine-grained spatial calibration for mobile UI interaction. Domain-specific fine-tuning on production app layouts is required.Tests adapt when UI changes. Broken steps fix themselves. Zero maintenance.

High recall, catastrophic precision

GPT-5.1 scores 75.61% recall but only 23.35% precision. It taps something — just the wrong thing. A false positive in mobile automation is worse than a miss: it corrupts state and cascades failures forward.

The fastest models failed hardest

GPT-5.2 runs at 5.0s p90. GPT-5.1 at 5.55s — both faster than Drizz's 4.81s. A fast wrong tap fails the test, moves on, and leaves corrupt app state poisoning every downstream step.

How we ran it. Reproduce it yourself.

We are both benchmark creator and top scorer. Full reproducibility — open dataset, disclosed constants, identical conditions for every model — is our only credibility.

Evaluation conditions
Each model received: one mobile screenshot + one natural language instruction. Prompted to return (x, y) coordinates. A tap is marked correct if predicted coordinates fall within the ground-truth bounding box. Zero-shot. No few-shot examples, no chain-of-thought, no model-specific prompt engineering. Identical prompts across all six models.
Scope
Tap-level spatial precision only. Not end-to-end task completion, multi-step reasoning, or stateful interaction.
Sample size
570 samples. Sufficient to differentiate production-viable from non-viable. Dataset expansion planned for v2.
Platform
Android screenshots only. iOS, iPadOS, and web mobile layouts are not included in this release.
Language
English-language interfaces only. RTL layouts and CJK interfaces not evaluated.
Self-evaluation
Drizz is benchmark creator and top scorer. Mitigated by open dataset + full annotation release under Apache 2.0.
Prompting
Zero-shot only. Performance may differ with few-shot examples or model-specific optimisations.

570 screenshots. 20 apps. Apache 2.0.

Run your own models. Challenge the scores. Publish your results. That's why we open-sourced it.

Food & Travel

Uber Eats, DoorDash, Instacart, Uber, Airbnb, Booking.com

6 Apps

Social & Entertainment

WhatsApp, Telegram, Reddit, Spotify, YouTube, Netflix

6 Apps

Finance, Health & Work

PayPal, Robinhood, Notion, Todoist, Duolingo, Coursera, MyFitnessPal, Strava

6 Apps

The formatted PDF. Forward Ready

Charts, tables, production math, and citation — in a single file built for sharing with your VP or engineering lead. The full report is freely available on this page. The PDF is the shareable version.

The AI-driven stability and ease of execution have helped us move faster while maintaining confidence in our releases. Drizz has become a valuable addition to our QA toolkit for scaling automation efficiently.Kudos to the Drizz team for building a practical and effective solution for modern QA teams.
No spam. One email with the PDF attached.
or
HIGHER INTENT
Get a custom analysis for your team's test count and release velocity.
book a demo
Oops! Something went wrong while submitting the form.

Write tests in plain English. Drizz executes them on real iOS & Android devices using Vision AI. Tests that self-heal, don’t break, and require zero maintenance.

10x
Faster than Appium
<5%
Flakiness rate
20mins
To your first test
With Drizz, we've simplified automation and ensured quality with less effort - no more broken tests or tedious setups.
The AI-driven stability and ease of execution have helped us move faster while maintaining confidence in our releases. Drizz has become a valuable addition to our QA toolkit for scaling automation efficiently.Kudos to the Drizz team for building a practical and effective solution for modern QA teams.
Drizz has become a valuable addition to our QA toolkit for scaling automation efficiently.Kudos to the Drizz team for building a practical and effective solution for modern QA teams.
Download the benchmark PDF
This is some text inside of a div block.
Schedule a demo