How I Decide Whether a Flaky Test Is a Product Bug, a Test Bug, or a CI Bug

The hardest part of dealing with flaky tests is not the rerun button. It is deciding what you are actually looking at. A red build can mean the product is broken, the test is wrong, the environment is unstable, or the CI pipeline itself is introducing noise. If you classify the failure incorrectly, you usually make the problem worse. You either patch a test that was correctly catching a real defect, or you spend hours triaging a product issue that only exists because your test setup is lying to you.

I try to answer one question first: what is the most likely flaky test root cause? That sounds obvious, but in practice teams often skip straight to the easiest fix. The framework below is the one I use when I need to separate product bug triage from test bug diagnosis and CI flakiness.

My default assumption is not “the test is flaky.” My default assumption is “something changed, and I do not yet know where the fault line is.”

The three buckets I care about

When a test fails intermittently, I usually classify it into one of three buckets.

1. Product bug

The system under test is genuinely broken, but the symptoms are timing-dependent or state-dependent, so the issue only shows up occasionally. This is common in distributed systems, race conditions, async UI updates, stale reads, cache invalidation, and event ordering problems.

Examples:

A save button returns success before the backend transaction commits.
A checkout flow succeeds only when the inventory service responds quickly.
A notification appears sometimes because the UI reads a stale API response.

These are not test problems. They are real defects that the test happened to expose.

2. Test bug

The test is making an invalid assumption, using brittle locators, racing the UI, depending on shared state, or not isolating its own data. In other words, the test is not modeling the product correctly.

Examples:

Using nth() or CSS selectors that break when layout changes.
Clicking before the element is actually actionable.
Reusing a user account that previous tests have polluted.
Asserting on animation timing instead of the final state.

This is classic test automation failure, and it is a test bug even if it only fails once in fifty runs.

3. CI bug

The code and the test may both be fine, but the pipeline, runner, container, clock, network, browser, resource limits, or dependency service is introducing instability.

Examples:

Tests pass locally but fail in CI because the runner is CPU constrained.
Browser startup is slow enough that a hardcoded timeout expires.
A shared staging database is being throttled.
Parallel jobs contend for ports, temp directories, or test data.

CI flakiness can look like a test bug, but the fix belongs in the pipeline, infrastructure, or environment contract.

Start with reproducibility, but do not worship it

A flaky failure that reproduces only in CI and not locally is still useful. It is not proof of a test bug, and it is not proof of a CI bug. It is just evidence that the failure is environment sensitive.

My first pass is simple:

Re-run the exact test in the same environment.
Re-run the test locally with as much of the same configuration as possible.
Compare what changes between passing and failing runs.

That sounds basic, but it quickly narrows the search space. I look at:

browser version
test data state
build agent type
CPU and memory pressure
parallelization level
feature flags
API stubs or real dependencies
seed data and timestamps

If the failure is reproducible on demand, it is usually not “flaky” in the strict sense. It is a deterministic bug with intermittent visibility. That is a useful distinction because deterministic bugs are easier to root-cause than genuinely non-deterministic ones.

The fastest diagnostic question: what changed?

When I get a red test, I ask what changed since the last green run.

If the application changed

If the product code changed, product bug triage moves higher on my list. I look for:

altered business logic
new async behavior
changed API contract
schema migrations
front-end rendering changes
permission or auth changes

If the test started failing right after a release and the failure message maps to business behavior, I treat the product as guilty until proven otherwise.

If the test changed

If someone edited locators, waits, fixtures, mocks, or page objects, I lean toward test bug diagnosis. Test-only changes often create new timing assumptions or break abstractions.

If neither changed

Then I examine the environment, CI pipeline, and shared dependencies. A failure with no code change often points to CI flakiness, environment drift, or a hidden dependency that is not stable enough for automation.

A practical decision tree I use

I do not need a fancy diagram to triage most failures, just a disciplined sequence.

Step 1, classify the failure surface

Ask where it failed:

assertion failure in the test body
timeout waiting for a UI or API condition
setup or teardown failure
browser crash or runner crash
dependency timeout

Assertion failures often point to product or test bugs. Setup and teardown failures often point to CI or environment issues. Crashes and timeouts need more context.

Step 2, inspect the failing artifact

I want:

screenshot or video
trace or HAR file if available
logs from app, test runner, and backend
browser console errors
network failures
timestamps

The more obvious the failure looks in artifacts, the less likely it is a mystery. In Playwright, traces are especially useful because they show actionability, timing, network requests, and DOM snapshots. Selenium gives less out of the box, so I rely more on explicit logging and screenshots.

Step 3, ask whether the test is observing a user-visible contract

If the test asserts on a state a real user can see and rely on, I am more suspicious of the product. If the test asserts on an implementation detail, I am more suspicious of the test.

A good user-facing assertion might be:

order confirmation is visible
saved record appears in the list
error message is shown on invalid input

A brittle implementation assertion might be:

specific DOM order without a business reason
exact number of network calls when caching may legitimately vary
internal CSS class names

Step 4, check for shared state

If two test runs can affect each other, I assume contamination until disproven. Shared state causes some of the ugliest flaky test root cause cases:

reused accounts
reused emails
reused database records
cached auth tokens
background jobs still running from previous tests
stale browser profiles

Isolation problems are often test bugs, but they can also expose missing environment guarantees. If the pipeline cannot guarantee a clean slate, that is a CI bug or at least a CI design flaw.

How I tell a product bug from a test bug

This is the most important distinction, because teams waste the most time here.

I suspect a product bug when:

the failure correlates with a real user workflow
the same assertion fails in multiple test layers, not just one UI test
the backend logs show inconsistent state, race conditions, or timeout errors
a manual reproduction is possible with the same sequence
the test fails only when a certain backend condition exists

For example, if a Playwright test waits for a success toast after saving a profile, and the API returns 200 but the record sometimes does not persist, I do not blame the test first. The test is probably exposing a real consistency problem.

I suspect a test bug when:

the failure disappears when I wait for the actual app state instead of a fixed timeout
a locator is too broad or too specific
the test depends on animation, ordering, or rendering implementation
rerunning only the failed test in a clean environment makes it pass consistently
the product logs show no corresponding issue

A common example is a test that clicks a button and immediately checks for a modal. If the modal opens through async state updates, the assertion is too eager. The product may be fine, the test is just racing the UI.

Here is the kind of Playwright fix I often make when the test is the problem:

typescript

await page.getByRole('button', { name: 'Save' }).click();
await expect(page.getByText('Profile saved')).toBeVisible();

And here is the kind of anti-pattern I try to remove:

typescript

await page.click('#save');
await page.waitForTimeout(2000);
expect(await page.textContent('.toast')).toContain('Profile saved');

The second snippet may pass often enough to fool people, but it is not a stable contract. It is just a delayed guess.

How I tell a CI bug from a test bug

CI flakiness is tricky because the test can be correct and still fail in the pipeline. I look for signs that the environment is part of the problem.

Strong CI bug signals

failures happen only on one runner type
failures correlate with high CPU, low memory, or noisy neighbors
local runs are stable with the same test code and data
browser launch, network connectivity, or container startup is slow or inconsistent
test failures vanish when the job is serialized instead of parallelized

A lot of CI flakiness is caused by hidden assumptions. The test may assume a browser starts within 10 seconds, a service is always reachable, or a port is always free. Those are not product assumptions, they are infrastructure assumptions.

Test bug or CI bug?

Sometimes the answer is both. A test with a realistic but too-tight timeout is not purely a test bug or a CI bug. It is a contract mismatch. The test should wait on the actual state change, while the pipeline should provide a stable enough environment for reasonable timing.

This is where test automation strategy matters. If your suite is full of hardcoded sleeps, you have created a timing debt that CI eventually collects.

For background reading on the broader concepts, the software testing, test automation, and continuous integration pages are useful starting points, but the real work is in how you model system behavior and environment stability.

The evidence I trust most

When people ask me how I know which bucket a failure belongs in, I usually say I trust correlated evidence more than a single symptom.

1. Logs from the application

If app logs show an exception, timeout, stale read, or validation failure, that is strong product evidence.

2. Logs from the test runner

If the runner tells me an element never became visible, a promise never resolved, or a browser context crashed, that points me toward test or CI issues.

3. A clean rerun with the same commit

A rerun that passes does not exonerate the test or the product. But if the rerun is clean and the artifacts show a different execution path, I know the issue is timing or environment related.

4. Parallel vs serial behavior

If a test only fails in parallel, I immediately look for shared state, port collisions, file collisions, or order dependence. That is usually not a product defect.

5. Local vs CI parity

If it only fails in CI, I ask what is different about CI:

headless browser mode
containerized network
smaller resources
different fonts or rendering engine support
mock services not available locally
different secrets or permissions

The more your local environment differs from CI, the less useful local green tests are as evidence.

A few failure patterns I have learned to recognize

Pattern 1, hidden asynchronous work

A test clicks Save and immediately checks the database or UI. The backend accepts the request, queues a job, and processes later. Sometimes the job finishes in time, sometimes it does not.

This can be a product bug if the user contract says the save is complete when the action completes. It can be a test bug if the test assumed immediate consistency without waiting for the actual completion condition.

Pattern 2, brittle selectors

A selector that depends on layout structure, text fragments, or incidental CSS classes often fails when the UI changes. That is almost always a test bug.

In Selenium, I prefer stable locators and explicit waits over indexing and sleeps:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

save = WebDriverWait(driver, 10).until( EC.element_to_be_clickable((By.CSS_SELECTOR, “button[data-testid=’save-profile’]”)) ) save.click()

The point is not Selenium versus Playwright. The point is whether the test expresses the business behavior clearly and waits on the right condition.

Pattern 3, shared test data

Two tests using the same email address or customer ID can interfere with each other. The failure may show up as a duplicate key error, a missing row, or a login failure.

This is usually a test bug, but if the system has no safe way to create isolated test entities, it is also a product or platform design issue. Good testability is part of the system contract.

Pattern 4, unstable environments

A container that runs fine on one agent and fails on another is often a CI issue. I have seen this with resource pressure, browser startup, networking, file permissions, and dependency initialization order.

If your CI environment is non-deterministic, your test suite will inherit that nondeterminism even if every assertion is perfect.

My triage checklist

When I need to decide quickly, I run through this checklist.

Ask these questions

Did the product, test, or environment change?
Is the failure reproducible locally, in CI, or both?
Does the failure happen in serial, parallel, or only one runner?
Do logs show a real application error or just a timeout?
Is the assertion user-facing or implementation-specific?
Is any shared state involved?
Would a longer wait hide the issue, or would it actually fix the race?

Then classify

If the product behavior is wrong, file a product bug.
If the test made an incorrect assumption, fix the test.
If the pipeline or environment is unstable, treat it as CI work, not test cleanup.

A longer timeout is not a diagnosis. It is sometimes a temporary workaround, but if it is the only fix you have, you probably have not found the root cause.

What I do after I classify the issue

For product bugs

I keep the failing test. I may improve the assertion, but I do not delete a test that caught a real issue just because the issue was intermittent. If the bug is real, the test is valuable.

I usually add:

clearer logging around the affected flow
a more precise assertion
a regression test at the right layer

For test bugs

I remove timing assumptions, fix locators, improve isolation, and make the test communicate intent better. If the test is too coupled to implementation details, I rewrite it at a higher level.

I also look for patterns. One flaky test bug often indicates a family of brittle tests.

For CI bugs

I treat the pipeline like production infrastructure. That means:

increase observability
reduce resource contention
make environment setup explicit
pin versions where needed
isolate jobs that interfere with each other
avoid depending on undocumented agent behavior

If you can reproduce the problem only on a specific runner image or only under load, the CI system needs engineering attention, not just another test retry.

How I prevent the same debate from happening again

The best way to reduce flaky test root cause debates is to make them less ambiguous.

Make assertions meaningful

Prefer assertions that reflect user value, not implementation details. If the user would not notice a temporary state, your test probably should not assert on it either.

Use deterministic test data

Create unique data per run when possible. Remove hidden shared state. Clean up aggressively.

Prefer explicit waits over sleeps

Wait for the condition you care about, not a guessed delay.

Make CI failures observable

Collect screenshots, traces, console logs, API responses, and timestamps. A failure without artifacts is expensive to classify.

Track flaky tests as incidents, not annoyances

If a test fails often enough to affect trust, it deserves the same seriousness as a production defect. Flakiness burns engineering time, slows releases, and creates bad incentives to ignore red builds.

A simple rule I keep in mind

If the test is faithfully describing the product contract, and the contract fails intermittently, suspect the product first. If the test is making assumptions about timing, structure, or environment, suspect the test first. If the behavior changes between CI and local without a code difference, suspect the pipeline first.

That sounds almost too simple, but it is the fastest way I know to avoid fixing the wrong layer.

The point is not to be philosophically pure about classification. The point is to stop wasting time. A flaky test should lead you to the weakest part of the system, whether that is the product, the test, or the CI environment. If you do this well, you get faster triage, fewer false fixes, and a healthier relationship with your automation suite.

Final thought

A flaky test is not just a nuisance, it is a signal about system design. Sometimes it reveals a product race condition. Sometimes it exposes a brittle test. Sometimes it tells you your CI pipeline is too noisy to trust. The job is not to label the failure quickly, it is to label it correctly.

If you build the habit of asking what changed, where the evidence points, and whether the test is observing a real user contract, your flaky test root cause analysis gets a lot easier. More importantly, your team stops normalizing red builds as background noise, which is one of the fastest ways to lose trust in automation.