How to Debug Flaky Playwright Tests with Trace Viewer, Logs, and Timing Clues

Flaky Playwright tests are one of those problems that can waste an entire morning, especially when the test passes locally, fails in CI, and then passes again on rerun. The temptation is to blame the framework or start rewriting the whole suite, but that usually hides the real issue. In most cases, the failure is giving you clues, just not in the first place you looked.

When I debug flaky Playwright tests, I try to answer three questions in order: what happened, when did it happen, and what changed between passing and failing runs. Playwright gives us good tools for this, especially trace viewer, but the trace alone is not the whole story. You also need logs, screenshots, network context, and a disciplined look at timing.

This article walks through a practical workflow I use when a Playwright test starts behaving inconsistently. The goal is not to make every test perfect by force. The goal is to narrow the failure mode quickly, so you can fix the root cause instead of masking it with retries and longer sleeps.

First, classify the flake before changing anything

Not every flaky failure is caused by the same thing. Before you edit selectors or add waits, classify the failure mode.

Common categories of Playwright flakiness

Timing issues, the app was not ready when the test acted
Selector instability, the locator points to a changing element
State leakage, data from a previous test affected the current one
Environment drift, CI is slower, different, or less stable than local
Network dependence, a request or backend response is delayed or inconsistent
Animation or transition interference, the element exists but is not ready to interact with

A flaky test is usually a deterministic bug with nondeterministic timing.

That framing matters because it changes how you debug. If you assume every flake is a locator issue, you will overfit your fix to the symptom. If you assume it is always a wait issue, you may miss state contamination or backend instability.

Reproduce the failure with the same conditions as CI

If the test only fails in CI, do not start by tuning it locally under ideal conditions. Try to reproduce the actual environment first.

What I try to match

Browser version and channel
Headless versus headed mode
CPU and memory constraints
Parallelism level
Test ordering
Environment variables and base URL
Network latency, where possible

Playwright runs differently in local debug mode than it does in CI. That difference is often enough to hide a race condition. If your CI uses continuous integration, treat the CI job as the source of truth for reproduction.

A useful pattern is to run the failing spec repeatedly in a loop.

for i in {1..20}; do npx playwright test tests/login.spec.ts --project=chromium || break; done

If the failure appears only once in ten or twenty runs, you are likely dealing with a timing or ordering problem. If it fails consistently under load or only in a specific browser, the issue is more deterministic than it first looked.

Turn on traces for the failing test only

The Playwright trace viewer is the fastest way to understand what the browser actually did. It gives you a timeline of actions, snapshots, network events, console messages, and DOM state. The trick is to collect it in a way that does not slow down your whole suite unnecessarily.

A practical trace setup

In Playwright config, I usually enable traces on first retry when debugging flaky tests:

import { defineConfig } from '@playwright/test';

export default defineConfig({ use: { trace: ‘on-first-retry’, }, });

This keeps normal runs fast, but preserves enough evidence when a test fails and retries. If a test is especially unstable, I will temporarily switch to always collecting traces for that spec while investigating.

What to inspect in trace viewer

Open the trace and look for these things first:

The exact action that failed
Whether the element was visible, attached, enabled, and stable
Whether the page was still navigating or rendering
Whether the previous step was slower than expected
Whether a network request was still pending
Whether a modal, toast, spinner, or overlay was blocking the action

Do not just replay the trace and hope the problem jumps out. Compare the failing run to a passing run if you can. The difference is often tiny, such as a click happening 200 ms too early or a rerender replacing the DOM node between locator resolution and action.

Read the trace like a timeline, not a video

The useful question is not “what did I see on screen?” It is “what state was the app in when the test acted?”

For example:

The button exists, but the request that enables it has not resolved yet
The list renders, but one item is still animating into place
The locator resolves, but the node is detached by a React rerender before click
The page loaded, but hydration is still finishing and event handlers are not wired

That kind of timing clue is exactly why trace viewer is valuable. It helps you identify whether you need a better locator, a better assertion, or a better synchronization point.

Add logs that explain state, not noise

A flaky test often fails because the browser state and test intent drift apart. Logs help bridge that gap, but only if they are specific.

Good logs answer concrete questions

Instead of logging every line of test code, log state transitions that matter:

Which page or route the test is on
Which account or test data it used
Which API response the test is waiting for
Which feature flag was active
Whether a retry happened
Which locator or assertion failed

Here is a small example using Playwright test steps and explicit logging.

import { test, expect } from '@playwright/test';

test('creates an invoice', async ({ page }, testInfo) => {
  console.log(`[${testInfo.title}] starting on ${process.env.BASE_URL}`);

await test.step(‘open invoice page’, async () => { await page.goto(‘/invoices/new’); });

await test.step(‘submit form’, async () => { await page.getByLabel(‘Customer’).fill(‘Acme Inc’); await page.getByRole(‘button’, { name: ‘Create invoice’ }).click(); });

await expect(page.getByText(‘Invoice created’)).toBeVisible(); });

The test.step boundaries make traces and logs easier to correlate. If a failure happens during form submission, you immediately know where to focus.

Capture browser-side logs too

Some flakes live in the browser console, not in the test runner output. When I debug stubborn failures, I often collect console messages and page errors.

page.on('console', msg => console.log(`browser:${msg.type()}: ${msg.text()}`));
page.on('pageerror', err => console.log(`pageerror: ${err.message}`));

This is especially useful for hidden frontend errors, failed chunk loads, or warnings that correlate with a race condition. If the UI logic silently fails before the button becomes clickable, the console may be the only place that explains why.

Use timing clues to separate slow from unstable

Timing issues in Playwright are easy to misread. A slow test is not always a flaky test. A flaky test is often a test that assumes a fixed timeline the app does not guarantee.

Look for these timing smells

The test uses waitForTimeout
The failure disappears when rerun manually in debug mode
The UI is waiting on data, but the test clicks immediately after navigation
A request sometimes takes long enough to expose a race
A modal or toast closes before the assertion checks it
A list is virtualized, so the element is not in the DOM yet

If your test only needs more time, the trace usually shows a consistent delay. If your test is truly flaky, the delay pattern changes from run to run.

Measure the real wait points

When a test interacts with the app too early, I want to know which event the test should have waited for instead. That might be:

load or domcontentloaded on navigation
A specific API response
A UI state like a spinner disappearing
A button becoming enabled
An element entering the viewport

Example:

typescript

await page.goto('/checkout');
await page.waitForResponse(resp => resp.url().includes('/cart') && resp.ok());
await expect(page.getByRole('button', { name: 'Place order' })).toBeEnabled();

This is better than waiting an arbitrary 2 seconds, because it expresses the actual contract the test depends on.

Check locators for stability, not just uniqueness

Many teams hear “use better locators” and stop there. Uniqueness is necessary, but not sufficient. A locator must also be stable across UI changes, rerenders, and user states.

Safer locator patterns

getByRole with accessible names
getByLabel for form fields
getByTestId when semantics are not enough
Scoped locators within a stable container

Less stable patterns

Deep CSS chains
Text that changes based on data or localization
Positional selectors like nth() without context
Matching on transient toast text or loading states

A locator can be unique and still flaky if the element it resolves to is not the right semantic target. For example, clicking a duplicated button in a responsive layout may pass on desktop but fail on smaller viewports if the DOM changes.

Stable locator choice reduces flakiness, but it does not fix bad synchronization by itself.

If the trace shows the locator resolved correctly but the action still failed, the issue may be timing, visibility, or overlay interference rather than selector quality.

Inspect the actionability checks

One advantage of Playwright is that it performs actionability checks before interacting with elements. That helps prevent obvious failures, but it can also reveal what state the app was in when the action failed.

If a click times out, ask whether the element was:

Visible
Stable
Receives events
Enabled
In view

A failure on receives events often means an overlay, tooltip, spinner, or sticky header is blocking the element. A failure on stable often points to animation or rerender churn. A failure on enabled usually means the app had not finished a prerequisite state change.

This is one reason not to overuse force-clicks. If you click through actionability problems, you may make the test pass while hiding a real user-facing issue.

Compare failing and passing traces side by side

When a flaky test passes after a rerun, save both runs. The contrast often exposes the underlying pattern faster than inspecting the failure alone.

What to compare

Navigation timing
API response timing
DOM snapshots before the failure
Any console errors or warnings
Whether a spinner or overlay was present
Whether the failure happened before or after a rerender

If the failing run shows a click happening before a response that enables the UI, you have a synchronization bug. If the passing run has a different route state or feature flag, you may have test data leakage. If the DOM differs between runs, your app may be rendering differently than the test expects.

Reduce the problem with small experiments

Once you have a likely cause, validate it with a minimal change. Do not rewrite the whole spec yet.

Good experiments

Add a single assertion before the failing action
Wait for the specific network response the UI depends on
Replace a brittle locator with a role-based locator
Remove parallel execution for that file to test for shared state
Rerun with tracing and browser console capture enabled

For example, if a submit button is intermittently disabled, I would check the app state directly rather than adding a sleep:

typescript

await expect(page.getByRole('button', { name: 'Submit' })).toBeEnabled();
await page.getByRole('button', { name: 'Submit' }).click();

If that assertion itself becomes flaky, the problem is not the click. It is the state transition that makes the button enabled.

Watch for hidden shared state in CI

Some flaky tests are not timing issues at all. They are isolation issues that only show up when tests run together.

Common sources of shared state

Reused accounts or records
Database cleanup that is incomplete or eventually consistent
Session storage or cookies persisting between tests
Parallel tests touching the same resource
Seed data that changes during the suite

If a test fails only after other tests run, inspect the state it inherits. This is especially important in CI pipelines where test automation runs at scale and the order may vary more than you expect.

A good isolation check is to run the single failing test against a fresh environment and then run the same test after a subset of related specs. If the second case fails more often, the suite is leaking state.

Use retries for signal, not as a cure

Retries can be useful, but they are not a fix. They are a detector. A retry that passes tells you the failure is nondeterministic, but it does not tell you why.

I use retries in two ways:

To capture traces and logs from the failing attempt
To prove whether the flake is environment-sensitive or app-sensitive

If a test only passes on retry, do not let the retry hide the problem indefinitely. Track it, debug it, and fix it. Otherwise your suite becomes a lottery with a green dashboard.

A practical checklist for the next flaky failure

When a Playwright test flakes, I usually work through this order:

Reproduce it under CI-like conditions
Capture a trace on failure
Inspect the failing action and surrounding timeline
Add focused logs around state transitions
Compare passing and failing runs
Check for timing, overlay, rerender, or network delays
Validate locators and actionability checks
Test for shared state or ordering problems
Make the smallest fix that expresses the real contract

This sequence keeps the debugging process grounded. It also prevents the common trap of piling on waits until the test becomes slower and still flaky.

When a fix is actually the wrong fix

Sometimes the first working fix is not the right one. Be careful with these patterns:

Replacing a locator with nth() because it passed once
Adding a sleep after every navigation
Forcing clicks through overlays
Disabling parallelism for the entire suite because one test is unstable
Adding retries without investigating the root cause

These changes can reduce noise temporarily, but they often shift the flake elsewhere. A good fix should make the test more aligned with user-visible behavior, not less.

Final thoughts

To debug flaky Playwright tests well, think like a timeline investigator, not just a test writer. Trace viewer shows you what happened in the browser, logs tell you what the test runner knew, and timing clues reveal where your assumptions do not match the app’s actual behavior.

The fastest path is usually not to rewrite the suite. It is to find the smallest point where the test is making a claim too early, too broadly, or against the wrong signal. Once you can see that clearly, the fix usually becomes obvious.

If you work with Playwright regularly, it is worth getting comfortable with the official docs and building a repeatable debugging habit. Flaky tests do not disappear by accident, but they do become manageable when you can read traces, logs, and timing as one story.