How to fix Flaky Mobile Tests: 4 Root Causes and How to Reduce Them in CI/CD (2026)

TL;DR

Flaky mobile tests are caused by four root issues: timing/sync problems, selector fragility, device inconsistency, and environment drift.
Research across thousands of projects confirms breakdown: async timing causes roughly 45% of flaky failures, selector and DOM instability causes about 28%, and rest splits between device variability and environment differences.

Why do timing and sync problems cause flaky mobile tests?

This is #1 cause. A study analyzing 201 fixes across 51 Apache projects found that 45% of flaky test fixes addressed async timing issues. A 2026 benchmark report confirmed same range across mobile and web.

What happens: your test clicks a button before screen finishes loading. The element exists in DOM but isn't interactive yet. The test fails. You re-run it 30 seconds later and it passes. Nothing changed except timing.

The bad fix: Thread.sleep() or time.sleep().

A hardcoded sleep pauses test for a fixed duration regardless of what app is doing. If you set 3 seconds and app loads in 1 second, you've wasted 2 seconds. If app takes 4 seconds on a slower device, test fails.
Multiply by hundreds of tests. A suite with 200 sleep calls at 3 seconds each adds 10 minutes of dead wait time per run.
Sleep statements mask performance regressions. If a page starts loading slower, you won't know until sleep isn't long enough.

The right fix: explicit, condition based waits.

Wait for a specific condition: element is visible, element is clickable, text has changed, network request has completed.
In Appium (Java): new WebDriverWait(driver, Duration.ofSeconds(10)).until(ExpectedConditions.visibilityOfElementLocated(By.id("login-btn")));
In Espresso: built-in IdlingResource syncs automatically with UI thread. This is why Espresso has lower flakiness than Appium.
In XCUITest: app.buttons["Login"].waitForExistence(timeout: 10)
In Maestro: built-in auto-waiting. No explicit wait commands needed.

The mobile-specific wrinkle: animations. A button might be "visible" but still animating into position. The test taps wrong coordinates. Disable animations in your test environment (adb shell settings put global window_animation_scale 0 on Android) or use a framework that waits for animation completion.

On r/Playwright, a developer who rewrote a flaky suite summarized it cleanly: "The real lesson: flakiness is usually a waiting problem, not a selector problem." In same thread, another commenter added a practical warning for SPAs: "waitForResponse is right call but SPAs will still bite you when response lands before component finishes rendering." The fix is combining network-level waits (API response received) with DOM-level checks (element is visible and interactive). And on topic of hardcoded timeouts, one reply put it bluntly: "You don't actually win until you go back and delete page.waitForTimeout calls someone added in a panic six months ago, because those are ones that mask real bugs."

How does selector fragility make tests flaky?

QA Wolf's analysis of production test suite failures found that DOM changes and brittle selectors account for about 28% of test failures. Not majority, but consistent.

What happens: a developer renames a component, changes a CSS class, restructures view hierarchy, or removes an accessibility ID during refactoring. The test can't find element. It fails. The QA team investigates, discovers selector broke, updates it, and re-runs. That cycle repeats every sprint.

Selectors from most fragile to most stable:

XPath (most fragile): depends on exact DOM position. One parent node change breaks it.
CSS class: changes during refactoring and style updates.
Resource ID / testID: stable if developers maintain them, but they're often missing or inconsistent.
Accessibility ID: most stable across platforms, but requires developer discipline to assign consistently.
Text-based matching: works until someone changes button copy.

How to reduce selector fragility (within selector-based frameworks):

Use accessibility IDs as your primary selector strategy. On r/QualityAssurance, testers recommend having developers assign an accessibility_id to each element you need to interact with.
Avoid XPath entirely. It's most brittle selector type and hardest to debug.
Use data-testid attributes for elements that don't have natural accessibility labels.
Build a selector review process: when a developer changes a component, PR template includes a checkbox for "updated test selectors."

The fundamental limit: no matter how disciplined your selector strategy, selectors are a coupling between your test code and your app code. When app changes, coupling can break. The question is how often and how quickly you can fix it. For teams where selector maintenance consumes 30-50% of QA time (common with Appium), structural problem is coupling itself.

On r/Playwright, a commenter gave most practical advice for any flaky test: "Find out why test is failing. That will help you figure out solution." It sounds obvious, but most teams skip investigation and jump to retries. Retries mask root cause. On r/PracticalTesting, a post made case that "Flaky tests need owners, not just retries." Assigning ownership to specific team members forces investigation instead of suppression.

Why do tests pass on one device and fail on another?

Device inconsistency is third root cause. It's especially painful on Android, where fragmentation means your test can pass on a Pixel 8 emulator and fail on a Samsung Galaxy S23 real device.

What causes it:

Screen resolution and density differences. A button that's visible on a 6.1" screen might require scrolling on a 5.5" screen. The test doesn't scroll. It fails.
OS version behavior. Android 13 handles permissions differently than Android 14. An implicit permission grant in your test works on one version and fails on other.
Emulator vs real device. Emulators skip hardware-level interactions: GPS, biometrics, camera, NFC, carrier network conditions. Tests that work on emulators fail on real devices because hardware interaction doesn't exist in emulated environment.
Animation and rendering speed. A $1,200 flagship renders screens faster than a $200 budget phone. Tests with tight timing assumptions pass on fast hardware and fail on slow hardware.

How to fix it:

Test on real devices for critical flows. Emulators for speed during development, real devices for release validation.
Use a device matrix that matches your actual user base. If 40% of your users are on Samsung devices, your test matrix should include Samsung devices.
Normalize test expectations. Don't assert exact pixel positions. Assert element visibility and content.
Run flaky-test detection across multiple devices. A test that passes on 3/4 devices is a device-dependent flake, not a real failure.

On r/QualityAssurance, a tester recommended pragmatic first step: "Disable specific tests until you have time to fix and stabilize them." Quarantining device-dependent flakes keeps your CI signal clean while you investigate.

On r/devops, another commenter pointed to test pyramid as structural fix: "The textbook solution is to have majority tests as unit test, maybe 20% of tests should be integration tests and lastly perhaps 5-10% system level tests." Fewer E2E tests means fewer opportunities for device-specific flakes to block your pipeline.

How does CI environment drift cause flaky results?

The fourth root cause. A test passes locally, fails in CI, and nobody can reproduce it.

What causes it:

Local vs CI resource differences. Your laptop has 16GB RAM and an SSD. The CI runner has 4GB RAM and shared CPU. Tests that depend on fast rendering or quick network responses fail when resources are constrained.
Stale caches and dependencies. The CI environment installs a different version of a dependency than your local machine. The app behaves differently. The test catches difference but reports it as a flaky failure instead of a real issue.
Network variability. Tests that call real APIs depend on network speed and API availability. A 500ms API delay on CI that doesn't happen locally causes a timeout. The test fails.
Parallel test interference. Two tests running in parallel modify same test data. One test reads stale data and fails.

How to fix it:

Pin all dependency versions in CI. No floating versions, no "latest" tags.
Use test data isolation. Each test creates and tears down its own data. No shared state between tests.
Mock external APIs in CI (or use a stable staging environment with predictable response times).
Match CI resource allocation to your test suite's needs. If tests need 8GB RAM, don't run them on a 4GB runner.
Track flaky tests separately from real failures. Slack built an automated flaky test detection system that dropped false failures from 80% to under 4%.

On r/Playwright, a commenter described CI specific pattern: "This comes up all time and usually is a mix of an infrastructure (bottlenecks that get revealed when you increase workers in your CI environment) and poor test code (writing tests that are not parallel friendly or brittle)." On r/devops, another pointed to shared state: "With 'flaky' tests you will likely have some of following - global state being used between tests that are not being accounted for correctly, such as a global logger, a global tracing provider, etc." Both are CI-specific problems that don't show up when running tests locally one at a time.

What is selector free approach to eliminating flakiness?

The first three root causes (timing, devices, environments) exist in every testing framework. The fourth category, selector fragility, is structural. It exists because selector-based frameworks couple your tests to your app's internal element identifiers.

Drizz's Vision AI removes that coupling.

How it works:

You write a test step in plain English: "Tap on Login."
The Vision AI captures a screenshot of current screen.
It identifies login button visually: by appearance, label, position, and context.
It executes tap. No XPath generated. No accessibility ID queried. No selector stored.
When UI changes (button moves, text changes, layout shifts), AI re-reads screen and finds element again. There is no stored selector to break.

What this fixes:

Selector fragility (28% category) drops to near zero. There's no selector to become fragile.
Refactoring no longer breaks tests. Developers can rename components, restructure views, and change IDs without notifying QA.
Cross-platform tests work from one suite. The Vision AI reads Android screen and iOS screen same way. No platform-specific selectors.

What this doesn't fix:

Timing issues still exist. Drizz uses adaptive wait logic (state detection instead of static timers), which reduces timing flakes, but async-heavy apps can still have edge cases.
Device inconsistency still exists. A button hidden behind a fold on a small screen is a real problem regardless of detection method. Drizz's real-device execution helps, but device fragmentation is a testing reality, not a framework problem.
Environment drift still exists. CI resource constraints affect Drizz same way they affect any tool.

The numbers: teams using Vision AI report 90%+ reduction in flaky test failures. The reduction comes primarily from eliminating selector fragility and reducing timing flakes through adaptive waits. The remaining flakiness is device and environment related, which requires infrastructure fixes, not framework changes.

For teams where selector maintenance eats sprint time, removing selector layer is highest-leverage fix. For teams where timing or environment drift is dominant problem, fixes in sections 1 and 4 of this guide apply regardless of which framework you use.

The pattern across every Reddit thread on flaky tests is same: teams retry instead of investigating, patch instead of fixing, and add sleep commands instead of understanding wait. The four root causes listed above are what investigation should target.

FAQ

What percentage of flaky tests are caused by selectors?

About 28%, based on QA Wolf's analysis of production test suite failures. The larger category is async timing at roughly 45%.

Does Vision AI eliminate all flaky tests?

No. It eliminates selector fragility. Timing, device, and environment issues still need separate fixes. Teams report 90%+ overall reduction because selector flakes compound with other causes.

Should I use Thread.sleep() to fix timing issues?

No. Use explicit, condition-based waits. Sleep commands waste time when app is fast and cause failures when app is slow.

How do I know if a test failure is flaky or a real bug?

Re-run it on same commit without code changes. Passes on retry = flaky. Fails consistently = real regression. Track both categories separately.

What's fastest way to reduce flaky tests right now?

Quarantine known flaky tests so they don't block deploys. Gather execution data over 1-2 sprints. Then fix or delete them. This stops bleeding while you address root causes.

Does Drizz handle timing issues too?

Yes. Drizz uses adaptive wait logic that detects screen state before executing next step, instead of static timers. It doesn't eliminate all timing issues, but it handles common cases without explicit wait commands.

‍

About the Author:

Asad Abrar

Co-founder & CEO, Drizz

Ex-Coinbase PM and IIT Kharagpur grad killing flaky mobile tests by day, and obsessing over F1 lap timings by night.