
Spatial accuracy is the most common failure point for mobile AI agents. UI-TapBench measures exactly this — across 570 annotated screenshots from 20 real production apps.
All models evaluated zero-shot on identical prompts. A tap is correct if predicted coordinates fall within the ground-truth bounding box. Sorted by accuracy.
Drizz tap accuracy — highest across all six models evaluated
GPT-5.1 accuracy on the same dataset — a model expected to perform far higher
Gap between the best and worst F1 scores —spatial precision is the differentiator
Your team's exact productivity numbers + benchmarks from 3 teams like yours. Formatted for forwarding straight to your VP or engineering lead.
Drizz leads Qwen by 1.53 percentage points. In a CI/CD pipeline running 50 test cases with 20 taps each, that gap creates 300 extra failures every day.
This is not a language failure. GPT-5.1 understands "tap the second option in the list" perfectly. It fails to map that instruction to the correct pixel on a dense mobile screen.
Mobile UIs pack dozens of tappable elements into small screens. "The third item" requires iterating over visually similar rows and landing on the exact bounding box — not recognising a region.Connect your app. Write tests in plain English. Start testing in minutes.
LMMs trained on diverse image understanding tasks lack fine-grained spatial calibration for mobile UI interaction. Domain-specific fine-tuning on production app layouts is required.Tests adapt when UI changes. Broken steps fix themselves. Zero maintenance.
GPT-5.1 scores 75.61% recall but only 23.35% precision. It taps something — just the wrong thing. A false positive in mobile automation is worse than a miss: it corrupts state and cascades failures forward.
GPT-5.2 runs at 5.0s p90. GPT-5.1 at 5.55s — both faster than Drizz's 4.81s. A fast wrong tap fails the test, moves on, and leaves corrupt app state poisoning every downstream step.
We are both benchmark creator and top scorer. Full reproducibility — open dataset, disclosed constants, identical conditions for every model — is our only credibility.
Run your own models. Challenge the scores. Publish your results. That's why we open-sourced it.
Uber Eats, DoorDash, Instacart, Uber, Airbnb, Booking.com
WhatsApp, Telegram, Reddit, Spotify, YouTube, Netflix
PayPal, Robinhood, Notion, Todoist, Duolingo, Coursera, MyFitnessPal, Strava

Charts, tables, production math, and citation — in a single file built for sharing with your VP or engineering lead. The full report is freely available on this page. The PDF is the shareable version.
Write tests in plain English. Drizz executes them on real iOS & Android devices using Vision AI. Tests that self-heal, don’t break, and require zero maintenance.