If a Playwright test only fails in CI, it is tempting to treat the pipeline as the problem. Sometimes CI is the first place where an underlying issue becomes visible, but most flaky tests are already unstable locally, they just have enough timing slack, warm caches, or manual attention to look healthy. The real goal is not to make CI more forgiving, it is to stop flaky Playwright tests before they reach CI at all.

I approach flake as a debugging problem, not a retry problem. Retries can hide signal, mask genuine defects, and make the suite feel healthier than it is. A better workflow is to isolate the failure mode, classify the source of instability, and fix the test or product behavior with the right level of precision.

What actually makes a Playwright test flaky

A flaky test is one that passes and fails under the same code, with no meaningful change in the feature under test. In Playwright, flakes usually come from one of a few buckets:

  • timing assumptions, such as waiting for an element before the UI is ready
  • brittle locators, especially text or structure-based selectors that shift with minor DOM changes
  • asynchronous state changes, such as network requests, animations, or client-side rendering
  • environment dependence, including viewport, CPU contention, browser differences, and test data collisions
  • shared state, where one test leaks storage, cookies, or backend data into another

If a test needs luck to pass, it is not a stable test, it is a probabilistic signal.

Before you start tuning retries, ask one question: is the failure in the test, the product, or the environment? Flaky test debugging is mostly about answering that question quickly and with evidence.

Start with a reproducible failure loop

The fastest path to stable tests is to make a flaky failure repeatable outside CI. If you cannot reproduce it locally, you are debugging a ghost. Reproduction does not need to be perfect, but it should be systematic.

A good loop looks like this:

  1. run the suspect test in isolation
  2. run it repeatedly
  3. vary the execution context one axis at a time
  4. collect artifacts when it fails

Playwright makes this easier because it provides traces, screenshots, and videos when enabled. You can also run a single spec or a single test case repeatedly.

bash npx playwright test tests/login.spec.ts –grep “should sign in” –repeat-each=20

If a test fails only once every 20 runs, that is still a stability problem. The repetition rate gives you a signal you can measure without pretending the issue is random.

When I see intermittent failures, I usually try a matrix of small variations:

  • headed and headless
  • Chromium, Firefox, and WebKit if the app supports them
  • local laptop versus CI-like container
  • normal and throttled CPU
  • fresh user context versus reused auth state

The point is not to test every possible combination. The point is to discover whether the flake is browser-specific, timing-related, or data-related.

Use Playwright retries carefully, and mostly as a diagnostic tool

Playwright retries are useful, but they should not be your first line of defense against flake. Retries are best used to surface patterns, not to excuse instability. A test that needs a retry every now and then is telling you something about synchronization or state management.

In playwright.config.ts, retries can be enabled for the whole suite or by project. During debugging, keep them low and observable.

import { defineConfig } from '@playwright/test';

export default defineConfig({ retries: 1, use: { trace: ‘on-first-retry’, screenshot: ‘only-on-failure’, video: ‘retain-on-failure’ } });

That configuration is not a solution by itself. It is a way to capture evidence on the first failed attempt without turning the entire run into a noisy retry loop.

I like to separate three cases:

  • the test passes on retry, which often indicates timing or readiness issues
  • the test fails consistently, which usually indicates a real defect or a broken locator
  • the test alternates between different failures, which often indicates test pollution or an unstable environment

That last case is especially important. If one run fails on a missing button and the next fails on a network timeout, you may have more than one issue.

Check the locator before you check the wait

A lot of flakiness blamed on timing is actually locator brittleness. If your selector points at something that changes shape, text, or position, the test will fail when the product evolves. Playwright’s locator model helps, but only if you use it intentionally.

Prefer selectors that reflect user intent and app semantics:

typescript

await page.getByRole('button', { name: 'Save changes' }).click();
await expect(page.getByTestId('toast-success')).toBeVisible();

Avoid selectors that depend on incidental structure:

typescript

await page.locator('div > div:nth-child(2) > button').click();

A good locator is stable, unique, and easy to explain. If you cannot describe why a selector should survive a minor UI refactor, it probably will not.

When I debug a flaky locator, I ask:

  • does it target a user-visible role, label, or test id?
  • could localization change the text?
  • could multiple matching elements exist after a rendering update?
  • is the element inside a shadow root, iframe, or virtualized list?

That last point matters. Virtualized lists and repeated components can create selectors that look correct but point to the wrong instance when the DOM changes under load.

Replace fixed sleeps with state-based waits

If your test still uses waitForTimeout, that is often the first place to look. Fixed sleeps are simple, but they are also the easiest way to encode timing assumptions into your suite. They can hide race conditions on a fast laptop and still fail in CI.

Instead of waiting for time to pass, wait for a state that proves the app is ready.

typescript

await page.goto('/dashboard');
await expect(page.getByRole('heading', { name: 'Dashboard' })).toBeVisible();
await expect(page.getByTestId('loading-spinner')).toBeHidden();

Good waits are tied to observable behavior, not internal implementation details. Sometimes the correct wait is a UI condition. Sometimes it is a network response. Sometimes it is a specific value appearing in the DOM.

Playwright also has built-in auto-waiting for many actions, which helps, but it is not magic. Auto-waiting handles readiness for the specific action it performs. It does not solve business logic timing, background fetches, or animations that block the target state.

Inspect network behavior when UI timing looks suspicious

Many UI flakes are really network or data availability issues in disguise. A test that clicks a button and expects a result might be failing because the API response is slow, because the request is duplicated, or because the page renders a transient error state before recovering.

When I suspect that, I inspect network calls directly.

page.on('response', response => {
  if (response.url().includes('/api/profile')) {
    console.log(response.status());
  }
});

This is not something to leave in the suite permanently, but it is useful during diagnosis. If the API occasionally returns 204, 500, or a delayed 200, the test failure might be a symptom of backend instability rather than a bad assertion.

You can also wait for a specific response when the action directly depends on it.

typescript

await Promise.all([
  page.waitForResponse(res => res.url().includes('/api/save') && res.ok()),
  page.getByRole('button', { name: 'Save' }).click()
]);

That pattern reduces race conditions between the click and the network event. It also makes the test’s dependency explicit, which helps during future maintenance.

Use traces as a debugging artifact, not just a failure souvenir

Playwright trace files are one of the best tools for flaky test debugging because they capture action order, DOM snapshots, console logs, network requests, and timing details. The mistake is to treat traces as something to open only after a failure. In practice, you want traces to help you compare a failing run with a passing run.

When a flaky test fails locally, I usually inspect:

  • the exact action that failed
  • whether the element existed but was not actionable
  • whether the DOM changed between the last successful step and the failure
  • whether there were console errors or network failures before the assertion

A common pattern is this: the test clicks a button, the page starts navigation, and the next assertion runs before the new view stabilizes. In a trace, you can often see the click succeed but the post-click state still in transition.

That tells you the fix is not to add a random wait. It is to wait on the navigation outcome, a route change, or a specific post-action marker.

Distinguish product bugs from test bugs

Not every flaky test should be fixed in the test. Sometimes the test is exposing a real product issue, such as duplicate requests, state drift, or unstable rendering.

Examples of product problems that often show up as flaky tests:

  • a save button can be clicked twice before it disables
  • a loading state disappears before data is fully rendered
  • a toast is shown and removed before the assertion can observe it
  • an item order changes because the backend returns unsorted data

In those cases, the test is doing its job. The product behavior is not deterministic enough to support reliable automation.

A helpful rule is this: if a user would experience the same ambiguity, the test is probably not the problem. If the test is depending on an internal race that users do not care about, then the test needs to be rewritten to follow real user-observable state.

Eliminate shared state and test data collisions

CI test instability often comes from tests stepping on each other. Parallel execution speeds things up, but it also makes hidden data coupling much more visible.

Common sources of shared-state flake include:

  • reusing the same account across tests
  • creating the same resource name in multiple parallel workers
  • depending on a record that another test deletes or mutates
  • leaving local storage, session storage, or cookies behind

The fix is to make the test data unique, isolated, and disposable. A simple pattern is to namespace resources by worker and run.

typescript

const id = `${test.info().workerIndex}-${Date.now()}`;
const email = `user-${id}@example.com`;

For backend-dependent tests, make sure your setup and teardown are deterministic. If your suite creates users, projects, or orders, clean them up in a way that tolerates partial failures. A bad cleanup step can create the next flake.

Parallelism deserves special attention. A test that passes reliably with one worker may fail with four because two cases are using the same server-side fixture or the same browser storage state. If a flaky test only appears under parallel load, look at isolation before looking at timing.

Make your assertions less brittle

Assertions can be flaky even when the app is fine. A test that checks for exact text, exact count, or immediate visibility is often too strict for asynchronous interfaces.

Prefer assertions that reflect user-facing outcomes and tolerate legitimate transition states.

typescript

await expect(page.getByText('Profile saved')).toBeVisible();
await expect(page.getByRole('button', { name: 'Save' })).toBeDisabled();

Avoid asserting on unstable details like transient CSS classes, DOM order, or text that is likely to change due to localization or content updates.

If the UI is intentionally dynamic, assert on the stable invariant. For example, instead of checking that a list contains exactly 10 items at a moment in time, check that the item you created eventually appears and remains visible after the relevant workflow completes.

The best assertion is often the one that proves a user outcome, not the one that mirrors the implementation.

Debug locally with CI-like conditions

Many teams say, “It only fails in CI,” then debug on a powerful laptop with warm dependencies, a persistent browser cache, and a different viewport. That comparison is not fair.

To reduce CI test instability, make local runs look more like CI. That means:

  • run headless when CI runs headless
  • use the same browser channel and version where possible
  • clear storage and caches between runs
  • use the same viewport and timezone
  • run inside a container if your pipeline does

You do not need to match every detail, but you do need to remove the biggest mismatches.

If your CI job uses Docker, keep the local debug container close to that setup. If the app behaves differently under low CPU or in a smaller viewport, reproduce that locally before you guess at a fix.

A practical triage workflow I use for flake

When a Playwright test starts acting unstable, I follow a simple order:

1. Reproduce the failure repeatedly

Run the test in isolation, in a loop, with traces enabled.

2. Check whether the failure is deterministic

If it fails every time, it is probably not flake. Treat it like a real defect or broken test.

3. Inspect the locator and assertion

Look for brittle selectors, exact text matches, and immediate expectations after asynchronous actions.

4. Check product state transitions

Confirm whether the app is actually ready when the assertion runs.

5. Look for data coupling

Verify that the test does not depend on state created by another test, previous run, or shared account.

6. Compare passing and failing traces

The difference between a clean run and a flaky run is often visible in a few frames.

7. Fix the smallest real cause

Do not stack retries, waits, and workarounds. Remove the specific instability source.

This workflow usually finds the issue faster than jumping straight into framework configuration.

What I change in the suite after fixing the bug

After a flaky test is fixed, I do not just merge the patch and move on. I harden the suite so the same class of bug does not return.

That can mean:

  • replacing text selectors with roles or test ids
  • removing fixed delays
  • using explicit waits for responses or visible state
  • isolating test data per worker
  • reducing hidden dependencies between tests
  • enabling trace artifacts for failures in CI

I also watch for overcorrection. It is easy to make a test more stable by making it less valuable. For example, adding too much mocking can hide integration failures. Adding too many retries can disguise the difference between a timing bug and a true product regression.

The goal is a suite that is both stable and honest.

A small example of a more reliable Playwright test

Here is a simple before-and-after shape that shows the difference between a brittle test and a more disciplined one.

typescript // brittle

await page.goto('/settings');
await page.waitForTimeout(2000);
await page.locator('button').nth(0).click();
await expect(page.locator('.toast')).toContainText('Saved');

A more stable version:

typescript

await page.goto('/settings');
await expect(page.getByRole('heading', { name: 'Settings' })).toBeVisible();

await Promise.all([ page.waitForResponse(res => res.url().includes(‘/api/settings’) && res.ok()), page.getByRole(‘button’, { name: ‘Save changes’ }).click() ]);

await expect(page.getByTestId('toast-success')).toHaveText('Saved');

The second version is not just cleaner, it encodes the actual contract of the interaction. It waits for the page to be ready, ties the click to the network event, and checks a stable success marker.

When to keep a retry, and when to delete the test

Sometimes the right answer is not to repair the test, but to retire it.

Keep the test if it still validates a meaningful user flow and can be stabilized with a clear fix. Delete or rewrite it if:

  • it duplicates coverage from another test without adding value
  • it depends on an unstable third-party integration that you do not control
  • it is asserting implementation details instead of outcomes
  • it costs more maintenance than the risk it covers

Retries should not be used to keep low-value tests alive. If a test is both flaky and weak, it is usually a good candidate for deletion or redesign.

Conclusion

To stop flaky Playwright tests before they reach CI, think like a debugger, not a scheduler. Make the failure repeatable, inspect the locator, watch the network, compare traces, and remove the hidden assumptions about timing and state. CI will always be less forgiving than a developer laptop, but that is not the real problem. The real problem is any test that only works when the environment is convenient.

If you tighten your selectors, wait on real state, isolate data, and treat retries as evidence rather than a fix, your suite gets easier to trust. That matters because a reliable test suite does more than reduce noise. It protects your team from false confidence and makes real regressions easier to catch before they spread through the pipeline.

Useful references