Flaky Playwright tests are one of those problems that can waste an entire morning, especially when the test passes locally, fails in CI, and then passes again on rerun. The temptation is to blame the framework or start rewriting the whole suite, but that usually hides the real issue. In most cases, the failure is giving you clues, just not in the first place you looked.

When I debug flaky Playwright tests, I try to answer three questions in order: what happened, when did it happen, and what changed between passing and failing runs. Playwright gives us good tools for this, especially trace viewer, but the trace alone is not the whole story. You also need logs, screenshots, network context, and a disciplined look at timing.

This article walks through a practical workflow I use when a Playwright test starts behaving inconsistently. The goal is not to make every test perfect by force. The goal is to narrow the failure mode quickly, so you can fix the root cause instead of masking it with retries and longer sleeps.

First, classify the flake before changing anything

Not every flaky failure is caused by the same thing. Before you edit selectors or add waits, classify the failure mode.

Common categories of Playwright flakiness

  • Timing issues, the app was not ready when the test acted
  • Selector instability, the locator points to a changing element
  • State leakage, data from a previous test affected the current one
  • Environment drift, CI is slower, different, or less stable than local
  • Network dependence, a request or backend response is delayed or inconsistent
  • Animation or transition interference, the element exists but is not ready to interact with

A flaky test is usually a deterministic bug with nondeterministic timing.

That framing matters because it changes how you debug. If you assume every flake is a locator issue, you will overfit your fix to the symptom. If you assume it is always a wait issue, you may miss state contamination or backend instability.

Reproduce the failure with the same conditions as CI

If the test only fails in CI, do not start by tuning it locally under ideal conditions. Try to reproduce the actual environment first.

What I try to match

  • Browser version and channel
  • Headless versus headed mode
  • CPU and memory constraints
  • Parallelism level
  • Test ordering
  • Environment variables and base URL
  • Network latency, where possible

Playwright runs differently in local debug mode than it does in CI. That difference is often enough to hide a race condition. If your CI uses continuous integration, treat the CI job as the source of truth for reproduction.

A useful pattern is to run the failing spec repeatedly in a loop.

for i in {1..20}; do npx playwright test tests/login.spec.ts --project=chromium || break; done

If the failure appears only once in ten or twenty runs, you are likely dealing with a timing or ordering problem. If it fails consistently under load or only in a specific browser, the issue is more deterministic than it first looked.

Turn on traces for the failing test only

The Playwright trace viewer is the fastest way to understand what the browser actually did. It gives you a timeline of actions, snapshots, network events, console messages, and DOM state. The trick is to collect it in a way that does not slow down your whole suite unnecessarily.

A practical trace setup

In Playwright config, I usually enable traces on first retry when debugging flaky tests:

import { defineConfig } from '@playwright/test';

export default defineConfig({ use: { trace: ‘on-first-retry’, }, });

This keeps normal runs fast, but preserves enough evidence when a test fails and retries. If a test is especially unstable, I will temporarily switch to always collecting traces for that spec while investigating.

What to inspect in trace viewer

Open the trace and look for these things first:

  • The exact action that failed
  • Whether the element was visible, attached, enabled, and stable
  • Whether the page was still navigating or rendering
  • Whether the previous step was slower than expected
  • Whether a network request was still pending
  • Whether a modal, toast, spinner, or overlay was blocking the action

Do not just replay the trace and hope the problem jumps out. Compare the failing run to a passing run if you can. The difference is often tiny, such as a click happening 200 ms too early or a rerender replacing the DOM node between locator resolution and action.

Read the trace like a timeline, not a video

The useful question is not “what did I see on screen?” It is “what state was the app in when the test acted?”

For example:

  • The button exists, but the request that enables it has not resolved yet
  • The list renders, but one item is still animating into place
  • The locator resolves, but the node is detached by a React rerender before click
  • The page loaded, but hydration is still finishing and event handlers are not wired

That kind of timing clue is exactly why trace viewer is valuable. It helps you identify whether you need a better locator, a better assertion, or a better synchronization point.

Add logs that explain state, not noise

A flaky test often fails because the browser state and test intent drift apart. Logs help bridge that gap, but only if they are specific.

Good logs answer concrete questions

Instead of logging every line of test code, log state transitions that matter:

  • Which page or route the test is on
  • Which account or test data it used
  • Which API response the test is waiting for
  • Which feature flag was active
  • Whether a retry happened
  • Which locator or assertion failed

Here is a small example using Playwright test steps and explicit logging.

import { test, expect } from '@playwright/test';
test('creates an invoice', async ({ page }, testInfo) => {
  console.log(`[${testInfo.title}] starting on ${process.env.BASE_URL}`);

await test.step(‘open invoice page’, async () => { await page.goto(‘/invoices/new’); });

await test.step(‘submit form’, async () => { await page.getByLabel(‘Customer’).fill(‘Acme Inc’); await page.getByRole(‘button’, { name: ‘Create invoice’ }).click(); });

await expect(page.getByText(‘Invoice created’)).toBeVisible(); });

The test.step boundaries make traces and logs easier to correlate. If a failure happens during form submission, you immediately know where to focus.

Capture browser-side logs too

Some flakes live in the browser console, not in the test runner output. When I debug stubborn failures, I often collect console messages and page errors.

page.on('console', msg => console.log(`browser:${msg.type()}: ${msg.text()}`));
page.on('pageerror', err => console.log(`pageerror: ${err.message}`));

This is especially useful for hidden frontend errors, failed chunk loads, or warnings that correlate with a race condition. If the UI logic silently fails before the button becomes clickable, the console may be the only place that explains why.

Use timing clues to separate slow from unstable

Timing issues in Playwright are easy to misread. A slow test is not always a flaky test. A flaky test is often a test that assumes a fixed timeline the app does not guarantee.

Look for these timing smells

  • The test uses waitForTimeout
  • The failure disappears when rerun manually in debug mode
  • The UI is waiting on data, but the test clicks immediately after navigation
  • A request sometimes takes long enough to expose a race
  • A modal or toast closes before the assertion checks it
  • A list is virtualized, so the element is not in the DOM yet

If your test only needs more time, the trace usually shows a consistent delay. If your test is truly flaky, the delay pattern changes from run to run.

Measure the real wait points

When a test interacts with the app too early, I want to know which event the test should have waited for instead. That might be:

  • load or domcontentloaded on navigation
  • A specific API response
  • A UI state like a spinner disappearing
  • A button becoming enabled
  • An element entering the viewport

Example:

typescript

await page.goto('/checkout');
await page.waitForResponse(resp => resp.url().includes('/cart') && resp.ok());
await expect(page.getByRole('button', { name: 'Place order' })).toBeEnabled();

This is better than waiting an arbitrary 2 seconds, because it expresses the actual contract the test depends on.

Check locators for stability, not just uniqueness

Many teams hear “use better locators” and stop there. Uniqueness is necessary, but not sufficient. A locator must also be stable across UI changes, rerenders, and user states.

Safer locator patterns

  • getByRole with accessible names
  • getByLabel for form fields
  • getByTestId when semantics are not enough
  • Scoped locators within a stable container

Less stable patterns

  • Deep CSS chains
  • Text that changes based on data or localization
  • Positional selectors like nth() without context
  • Matching on transient toast text or loading states

A locator can be unique and still flaky if the element it resolves to is not the right semantic target. For example, clicking a duplicated button in a responsive layout may pass on desktop but fail on smaller viewports if the DOM changes.

Stable locator choice reduces flakiness, but it does not fix bad synchronization by itself.

If the trace shows the locator resolved correctly but the action still failed, the issue may be timing, visibility, or overlay interference rather than selector quality.

Inspect the actionability checks

One advantage of Playwright is that it performs actionability checks before interacting with elements. That helps prevent obvious failures, but it can also reveal what state the app was in when the action failed.

If a click times out, ask whether the element was:

  • Visible
  • Stable
  • Receives events
  • Enabled
  • In view

A failure on receives events often means an overlay, tooltip, spinner, or sticky header is blocking the element. A failure on stable often points to animation or rerender churn. A failure on enabled usually means the app had not finished a prerequisite state change.

This is one reason not to overuse force-clicks. If you click through actionability problems, you may make the test pass while hiding a real user-facing issue.

Compare failing and passing traces side by side

When a flaky test passes after a rerun, save both runs. The contrast often exposes the underlying pattern faster than inspecting the failure alone.

What to compare

  • Navigation timing
  • API response timing
  • DOM snapshots before the failure
  • Any console errors or warnings
  • Whether a spinner or overlay was present
  • Whether the failure happened before or after a rerender

If the failing run shows a click happening before a response that enables the UI, you have a synchronization bug. If the passing run has a different route state or feature flag, you may have test data leakage. If the DOM differs between runs, your app may be rendering differently than the test expects.

Reduce the problem with small experiments

Once you have a likely cause, validate it with a minimal change. Do not rewrite the whole spec yet.

Good experiments

  • Add a single assertion before the failing action
  • Wait for the specific network response the UI depends on
  • Replace a brittle locator with a role-based locator
  • Remove parallel execution for that file to test for shared state
  • Rerun with tracing and browser console capture enabled

For example, if a submit button is intermittently disabled, I would check the app state directly rather than adding a sleep:

typescript

await expect(page.getByRole('button', { name: 'Submit' })).toBeEnabled();
await page.getByRole('button', { name: 'Submit' }).click();

If that assertion itself becomes flaky, the problem is not the click. It is the state transition that makes the button enabled.

Watch for hidden shared state in CI

Some flaky tests are not timing issues at all. They are isolation issues that only show up when tests run together.

Common sources of shared state

  • Reused accounts or records
  • Database cleanup that is incomplete or eventually consistent
  • Session storage or cookies persisting between tests
  • Parallel tests touching the same resource
  • Seed data that changes during the suite

If a test fails only after other tests run, inspect the state it inherits. This is especially important in CI pipelines where test automation runs at scale and the order may vary more than you expect.

A good isolation check is to run the single failing test against a fresh environment and then run the same test after a subset of related specs. If the second case fails more often, the suite is leaking state.

Use retries for signal, not as a cure

Retries can be useful, but they are not a fix. They are a detector. A retry that passes tells you the failure is nondeterministic, but it does not tell you why.

I use retries in two ways:

  1. To capture traces and logs from the failing attempt
  2. To prove whether the flake is environment-sensitive or app-sensitive

If a test only passes on retry, do not let the retry hide the problem indefinitely. Track it, debug it, and fix it. Otherwise your suite becomes a lottery with a green dashboard.

A practical checklist for the next flaky failure

When a Playwright test flakes, I usually work through this order:

  1. Reproduce it under CI-like conditions
  2. Capture a trace on failure
  3. Inspect the failing action and surrounding timeline
  4. Add focused logs around state transitions
  5. Compare passing and failing runs
  6. Check for timing, overlay, rerender, or network delays
  7. Validate locators and actionability checks
  8. Test for shared state or ordering problems
  9. Make the smallest fix that expresses the real contract

This sequence keeps the debugging process grounded. It also prevents the common trap of piling on waits until the test becomes slower and still flaky.

When a fix is actually the wrong fix

Sometimes the first working fix is not the right one. Be careful with these patterns:

  • Replacing a locator with nth() because it passed once
  • Adding a sleep after every navigation
  • Forcing clicks through overlays
  • Disabling parallelism for the entire suite because one test is unstable
  • Adding retries without investigating the root cause

These changes can reduce noise temporarily, but they often shift the flake elsewhere. A good fix should make the test more aligned with user-visible behavior, not less.

Final thoughts

To debug flaky Playwright tests well, think like a timeline investigator, not just a test writer. Trace viewer shows you what happened in the browser, logs tell you what the test runner knew, and timing clues reveal where your assumptions do not match the app’s actual behavior.

The fastest path is usually not to rewrite the suite. It is to find the smallest point where the test is making a claim too early, too broadly, or against the wrong signal. Once you can see that clearly, the fix usually becomes obvious.

If you work with Playwright regularly, it is worth getting comfortable with the official docs and building a repeatable debugging habit. Flaky tests do not disappear by accident, but they do become manageable when you can read traces, logs, and timing as one story.