When browser tests pass in a preview environment and then fail after merge, the first instinct is usually to blame flakiness. Sometimes that is correct. More often, the failure is telling you something more specific: the preview and post-merge environments are not equivalent in a way your test depends on.

That difference might be obvious, like a different base URL or feature flag. It might be subtle, like cached assets, slower API responses, a different build artifact, or a branch-specific mock that never existed in main. The tricky part is that preview success can create false confidence. A test that only passes before merge is not stable, it is partially coupled to the conditions of the preview system.

If you have ever seen browser tests pass in preview but fail after merge, the fastest way to debug is to stop treating it as one problem. Separate it into four categories: environment drift, hidden dependencies, deployment differences, and timing issues. That separation is what this guide is built around.

What usually changes between preview and post-merge

Most teams think of preview and production-like pipelines as the same software at different moments. In practice, they often differ in small ways that matter to browser automation.

Common differences include:

  • different build artifacts or image tags
  • branch-specific environment variables
  • different API stubs or mock data sets
  • CDN or asset caching behavior
  • feature flags that are enabled only in one environment
  • browser container versions or screen sizes
  • slower or more contended shared infrastructure after merge
  • authentication state, cookies, or session lifetimes

If your test only passes when the branch is isolated, the test is probably relying on a branch-only assumption, even if the UI looks identical.

This is why the phrase environment drift matters. In software testing terms, drift is any divergence between what you think the system is and what it actually is in the environment where the test runs. Test automation literature has long emphasized execution context as part of the test result, not a detail to ignore, see software testing, test automation, and continuous integration.

Start with a question, not with a retry

Before adding retries or waitForTimeout, ask one simple question:

What changed when the branch was merged?

If you can answer that, you usually know where to look first. If you cannot, do not guess. Pull together the exact differences between preview and post-merge execution:

  • environment variables
  • app version or commit SHA
  • Docker image digest
  • browser version
  • test runner version
  • API backend version
  • data seed or migration version
  • feature flag state
  • network route, proxy, or auth layer

For teams running Playwright or Selenium in CI, I like to compare those values in the test report itself. It is not enough to know that a test failed, I want to know where it ran and what it ran against.

import { test } from '@playwright/test';
test('capture environment details', async ({ page }) => {
  console.log({
    commit: process.env.GITHUB_SHA,
    branch: process.env.GITHUB_REF_NAME,
    baseUrl: process.env.BASE_URL,
    browser: test.info().project.name,
  });

await page.goto(process.env.BASE_URL!); });

That little bit of metadata often reveals the first clue. If preview is using a branch-specific service URL and merge is using shared staging, the browser test is not comparing like with like.

Checklist 1: rule out environment drift first

Environment drift is the highest-probability category when preview passes and merge fails. It tends to appear as a browser symptom, but the root cause is usually configuration.

1. Compare runtime configuration

Look at both places, preview and post-merge, and compare the values that can change application behavior:

  • API_BASE_URL
  • AUTH_PROVIDER_URL
  • FEATURE_X_ENABLED
  • ASSET_CDN_URL
  • APP_ENV
  • NODE_ENV
  • TIME_ZONE
  • LOCALE

A feature flag set to true in preview and false after merge can change DOM structure, button labels, or timing. That makes locators fail in a way that looks like a flaky test, but the real issue is a branch-specific failure path.

2. Check the artifact, not just the commit

Merged code may be identical, but the deployed artifact may not be. Different minification, caching, or image layering can change behavior. In containerized pipelines, capture the image digest and deploy it as part of the test logs. In serverless or frontend-only pipelines, compare the built output hash.

If the preview environment uses a fresh build and the post-merge environment reuses a cached artifact, you can end up debugging yesterday’s code.

3. Verify browser and runner parity

Browser tests depend on the browser engine more than many teams expect. A change from Chromium 123 to 124 can alter focus behavior, accessibility tree ordering, or timing of network events. A change in headless mode defaults can also affect rendering and viewport calculation.

If your preview job uses one image and your merge job uses another, the test is not proving portability, it is proving that one specific stack works.

4. Confirm data seeding and migrations

Preview environments often get fresh test data. Post-merge environments sometimes sit on longer-lived databases. That difference can surface as missing entities, unexpected sorting order, old schema fields, or records created by another branch.

If the test creates data through the UI, it may work in preview because the database is empty enough to be predictable. After merge, the same flow may collide with existing rows.

Checklist 2: look for hidden dependencies the test was quietly using

A hidden dependency is anything the browser test relies on, without making that dependency explicit in the test setup.

Typical examples:

  • a specific seed user already exists
  • a mock service returns hard-coded data in preview
  • the test assumes a clean localStorage or cookie jar
  • a marketing banner is disabled only in preview
  • the app assumes a particular timezone or locale
  • the test reaches a dependency that is only available from a branch network

This is where branch-specific failures often show up. A branch may wire a mock API or injected fixture that does not exist once the code lands in the shared environment.

Make dependencies explicit

I prefer tests to state their requirements up front. For example, if a test assumes a user exists, create that user in the test setup or through an API call before visiting the page.

from selenium import webdriver

browser = webdriver.Chrome() browser.delete_all_cookies() browser.execute_script(“window.localStorage.clear();”) browser.get(“https://staging.example.com/login”)

That example is simple, but the principle matters. If the test only passes because the browser carried state from a previous run, it is not isolated.

Watch for mock-to-real drift

Preview environments often use service mocks that are more forgiving than real services. They return faster, they accept more formats, and they rarely reproduce production latency. After merge, the app may talk to the real service, and a selector, timeout, or form submission that once felt instant now falls apart.

When this happens, inspect whether the UI is actually depending on response timing. A button might become enabled only after the first API response. If the mock returns in 20 ms and the real service returns in 2 seconds, your test may race the UI state machine.

Checklist 3: compare deployment differences, not just code differences

One of the biggest traps is assuming the merged code is the same as the preview branch code, only more integrated. In practice, deployment steps often change behavior.

Asset caching

Browser tests may pass in preview if the preview environment serves fresh JS bundles and fail after merge because a CDN or reverse proxy keeps old assets around longer.

Symptoms include:

  • a button exists but triggers old logic
  • the DOM renders with an old component version
  • tests pass only after a hard refresh
  • screenshots show mismatched labels or layout changes

If this sounds familiar, check cache headers and asset fingerprints. Make sure the browser is not mixing an old HTML shell with new JavaScript.

Build-time versus runtime configuration

Some apps bake environment values at build time. Others read them at runtime. If preview builds with one set of values and merge deploys the same artifact into a different runtime configuration, the behavior can diverge without any code change.

A common example is a feature flag checked during build. The branch preview uses the new path, but the merged deployment builds under a different flag state and ships different code.

Deployment ordering

Preview systems often deploy one app and one backend. Merge pipelines may roll out service by service. During that window, the browser test can hit a mixed-version environment where the frontend expects a field or endpoint the backend has not exposed yet.

That is not always a product bug, it can be a deployment sequencing problem. But the browser test is still doing its job by exposing the mismatch.

Checklist 4: treat timing issues as evidence, not noise

Timing is where many teams reach for retries first. I use retries, but only after I know what they are hiding.

Search for race conditions in the UI flow

If a browser test fails after merge, ask whether the merged environment is slower or more contended. Some UI flows are only stable when the system is quick.

Typical race patterns:

  • the test clicks before the button is enabled
  • the page reads from the DOM before an async render completes
  • a toast disappears before the assertion runs
  • navigation starts before the expected state is visible
  • the test targets an element that is replaced during hydration

Playwright makes it easier to wait on specific states, which is better than sleeping. Selenium can do the same if you use explicit waits.

import { expect, test } from '@playwright/test';
test('wait for the real UI state', async ({ page }) => {
  await page.goto('https://staging.example.com/orders');
  await expect(page.getByRole('button', { name: 'Submit order' })).toBeEnabled();
  await page.getByRole('button', { name: 'Submit order' }).click();
});

That pattern is more durable than waiting a fixed number of seconds. It also makes the test document the actual contract, which is usually, “the button must be enabled before action,” not “wait 5 seconds and hope.”

Distinguish slowness from instability

A slow test and a flaky test are not the same thing. A slow test might always pass on the second attempt, but that only proves there is a race. It does not tell you whether the race is due to code, environment, or deployment.

Look at:

  • request timing in the browser network panel or trace
  • long tasks and hydration delays
  • API response times in the merge environment
  • CPU or memory contention in shared CI runners

If the merge environment is noisier, you might be exposing an unhandled latency assumption. That is useful signal, even if preview was fast enough to hide it.

A practical debugging sequence I use

When I have a browser test that passes in preview but fails after merge, I follow this sequence:

  1. Re-run the exact same commit in both environments.
  2. Capture environment variables, browser version, image digest, and commit SHA.
  3. Compare network calls, response payloads, and status codes.
  4. Disable caching temporarily, if possible, to see whether the failure disappears.
  5. Run with tracing or video to spot timing and state changes.
  6. Recreate the merged environment locally, as closely as possible.
  7. Remove any retry or sleep until the root cause is understood.

That last step matters. Retries can turn a clear bug into an intermittent mystery. They should be a symptom-management tool, not the diagnosis.

A useful example of hidden dependency failure

Suppose a checkout test passes in preview because the branch environment has a seeded promo code and a mock payment provider that returns instantly. After merge, the code goes to shared staging.

Now the test fails because:

  • the promo code no longer exists in shared data
  • the payment service requires a real tokenized flow
  • the UI waits for a slower redirect to complete
  • the post-merge deployment uses a different locale, changing button text

From the outside, this looks like one flaky test. In reality, there are at least three dependencies the test never named.

The fix is not to make the test more forgiving. The fix is to define the test contract. If the test depends on a promo code, create it in setup. If it depends on a payment stub, inject a stub in both environments. If text is localized, use stable roles or test IDs instead of locale-sensitive labels.

Locators are often the first thing to break, but not the first thing to blame

A failing selector can be a genuine sign of UI drift, but it can also be a downstream symptom. If a feature flag removes a menu item, the test might fail on the locator, while the actual issue is a deployment configuration mismatch.

That is why resilient locators help, but they do not solve environment drift.

Prefer locators that reflect user intent, not implementation detail:

  • role and accessible name
  • stable data attributes
  • text only when text is part of the contract

Avoid brittle selectors that assume exact layout structure. But do not use better locators as an excuse to ignore root causes like build parity or stale assets.

How I separate preview-only behavior from merge-time changes

A simple mental model helps:

  • If preview and merge run the same code, then a failure likely comes from environment or deployment differences.
  • If preview and merge run different code paths, then the branch is exercising a path not available after merge.
  • If both run the same path but only merge fails, timing or contention is the likely culprit.

You can turn that into a decision tree:

If the DOM differs

Check flags, backend responses, cached assets, and deploy version.

If the DOM is the same but the click fails

Check timing, overlays, disabled states, and navigation events.

If the browser sees stale content

Check CDN caching, service worker behavior, and cache invalidation.

If the failure happens only on main or staging

Check merged data, permissions, auth state, and shared environment contention.

CI/CD patterns that reduce this class of failure

The best fix is to make preview and post-merge environments more similar. In practice, that means tightening your pipeline design.

Build once, deploy the same artifact

If possible, build one artifact and promote it through preview and post-merge environments. That reduces artifact drift and makes failures easier to interpret.

Seed data deterministically

Use repeatable setup scripts, not ad hoc manual state. Browser tests are much easier to trust when the initial state is known.

Expose environment metadata in test output

Log commit SHA, branch name, image digest, browser version, and flag snapshot. This gives you something to compare when a failure only happens after merge.

Use targeted waits, not global sleeps

Explicit waits for visible UI state or network idleness are usually more reliable than fixed delays.

Treat shared staging like production, not like a playground

If staging is where merge-time failures appear, it needs the same rigor as production-like systems. Otherwise, you are training the test suite to accept a false environment.

name: browser-tests
on:
  pull_request:
  push:
    branches: [main]

jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Print build metadata run: | echo “SHA=${GITHUB_SHA}” echo “REF=${GITHUB_REF_NAME}” - name: Run browser tests run: npm test env: BASE_URL: $

That pipeline snippet is intentionally simple, but the pattern is useful. Capture metadata, then run the test in a predictable way.

When retries are acceptable

Retries are not evil. They are appropriate when the system exhibits transient conditions that do not change the test contract, for example:

  • a known network hiccup in a noncritical environment
  • a rare browser startup failure in shared CI
  • temporary contention on a remote grid

Retries are not appropriate when they hide a real difference between preview and merge. If the first run fails because the app served an old bundle or the flag state changed, retrying only wastes time.

A good rule is this:

Retry transient infrastructure noise, do not retry around product or deployment mismatches.

The final question to ask before closing the bug

If a test passes in preview but fails after merge, ask whether the test is validating the application or validating a particular environment shape.

That distinction is the heart of the problem. Browser tests should tell you whether the user flow works under defined conditions. If preview-only success is masking branch-specific failures, hidden dependencies, or deployment differences, the test is not yet telling you the truth.

When you make the environment explicit, reduce drift, and treat timing as a signal instead of noise, these failures become much easier to isolate. More importantly, they become preventable.

The next time browser tests pass in preview but fail after merge, do not start by adding another retry. Start by comparing the two worlds the test is actually running in. That is usually where the answer lives.