How to Store Playwright Traces, Videos, and Screenshots in CI So Triage Stops Being Slow

If you have ever opened a failed CI run and found only a stack trace, a timeout, and a bunch of logs with no clear clue, you already know why artifact storage matters. A Playwright failure is often not really a code failure, it is an evidence problem. The test saw something unexpected, but without a trace, screenshot, or video, triage turns into reruns and guesswork.

I like Playwright because it gives us the right debugging primitives out of the box, but those artifacts only help if they are captured consistently and stored where the team can actually use them. This is where many pipelines fall apart. The tests are already running in CI, the failures are already expensive, and then the evidence gets lost because the job container is ephemeral, the artifact retention is too short, or the upload rule is too broad.

This article is a practical setup guide for people who want to store Playwright traces in CI, keep videos and screenshots with the right failure context, and make triage faster without drowning the pipeline in noise.

What you are trying to capture, and why each artifact matters

Playwright gives you three main types of debugging evidence that are worth storing in CI:

Traces, which include the page timeline, actions, DOM snapshots, console messages, network information, and screenshots along the way
Videos, which show the test run visually, especially useful for timing issues or UI motion
Screenshots, which give a fast visual checkpoint at the moment of failure or expectation mismatch

Playwright’s official docs are the right place to start if you want the complete feature set and current syntax, especially the Playwright introduction.

The goal is not to save everything all the time, the goal is to save enough evidence that one failure does not require three reruns.

Here is how I think about the tradeoff:

Trace is the highest value artifact for most UI failures
Video is useful when timing, animation, or state transitions matter
Screenshot is lightweight and good for quick visual confirmation, but it rarely explains the full story alone

If your team only stores screenshots, you will eventually get stuck on issues like selectors that matched the wrong element, hidden overlays, delayed API responses, or a modal that opened and closed before the screenshot was taken. The trace makes those problems visible.

The CI problem, in plain terms

A local run and a CI run are not the same environment. CI adds several sources of variability:

fresh containers or VMs
slower or more contended CPUs
different fonts and browser dependencies
parallel execution
network latency to the app under test or dependencies
cleanup between jobs

That means the artifact strategy should assume failure will be intermittent. If a test fails once in fifty runs, the value of the evidence is highest at the exact failure moment. If you only capture artifacts on rerun, you may never reproduce the same state.

Continuous integration, as a practice, is about integrating changes frequently and validating them automatically, but test evidence is part of that system too. A CI pipeline without useful artifacts is really only half an observability system.

A sane default policy for Playwright artifacts

If you want a starting point, I recommend this default policy:

Traces: capture on first retry or failure, depending on how noisy the suite is
Videos: capture on failure only, or on first retry for unstable suites
Screenshots: capture on failure, and optionally on assertion-heavy tests

For most teams, the sweet spot is to store traces only when something goes wrong, rather than for every passing run. That keeps storage costs and upload time under control.

For teams with heavy flakiness, I sometimes start with retain-on-failure for both traces and videos, then tighten later once the suite is more stable.

Configure Playwright to keep the right evidence

Playwright Test supports artifact configuration in the test config. A common pattern is to keep traces and videos only when a test fails, and to capture screenshots on failure.

import { defineConfig } from '@playwright/test';

export default defineConfig({ use: { trace: ‘retain-on-failure’, video: ‘retain-on-failure’, screenshot: ‘only-on-failure’ } });

This is usually enough to get started. If you need even more control, you can tune behavior per project, per test type, or per environment.

For example, I often make the artifact policy stricter in CI than locally, because local runs are interactive and CI failures are the ones that need durable evidence.

import { defineConfig } from '@playwright/test';

export default defineConfig({ use: { trace: process.env.CI ? ‘retain-on-failure’ : ‘on-first-retry’, video: process.env.CI ? ‘retain-on-failure’ : ‘off’, screenshot: ‘only-on-failure’ }, retries: process.env.CI ? 1 : 0 });

Why retries matter here

Retries are not just about smoothing over transient issues. They also give you a useful boundary for when to keep artifacts. If a test fails once and passes on retry, the first failure is often the one you want to inspect, because it may contain the original broken state.

That is why I prefer trace-on-first-retry or trace-on-failure over blanket always-on trace collection. Always-on traces are great for small suites, but they create more storage and upload overhead than many teams actually need.

Store artifacts where the job cannot lose them

The most common mistake I see is assuming the CI job workspace is enough. It is not. The workspace usually disappears when the job ends, the runner is recycled, or the container gets destroyed.

You want artifact storage to be explicit:

Save the Playwright output to a stable directory
Upload that directory as a CI artifact
Retain it long enough for triage and investigation
Make it easy to download from the job summary or test report

A simple directory layout helps a lot:

text playwright-report/ artifacts/ traces/ videos/ screenshots/

You can also keep the default Playwright output directory and upload it directly, but separating failure artifacts from the HTML report often makes CI jobs easier to scan.

A GitHub Actions example that actually preserves evidence

Here is a realistic GitHub Actions setup for storing Playwright traces, videos, screenshots, and the HTML report.

name: e2e

on: push: pull_request:

jobs: playwright: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npx playwright install –with-deps - run: npx playwright test env: CI: true - uses: actions/upload-artifact@v4 if: always() with: name: playwright-artifacts path: | playwright-report/ test-results/ retention-days: 7

A few details matter here:

if: always() ensures you upload artifacts even if the test step fails
retention-days should match your triage window, not just a default
test-results/ is where Playwright stores many failure artifacts by default
playwright-report/ keeps the HTML report available for later inspection

If you use matrix builds, include the OS and browser in the artifact name so the output does not overwrite itself.

- uses: actions/upload-artifact@v4
        if: always()
        with:
          name: playwright-artifacts-$-$
          path: |
            playwright-report/
            test-results/

Make the failure output easy to navigate

The raw files are useful, but triage gets faster when the artifacts are organized in a predictable way. I like the following conventions:

one test run, one artifact bundle
include the CI run number or commit SHA in the artifact name
keep the browser name in the path or artifact name
preserve the test title or file name in the artifact structure

For example, a failure bundle might look like this:

text playwright-artifacts-chromium-1042/ playwright-report/ test-results/ checkout.spec.ts-payment-flow-fails/ trace.zip screenshot.png video.webm

That structure means a developer can move from a failed job to the exact test evidence without scrolling through unrelated files.

When to store traces on every run

There are cases where storing traces on every run makes sense:

a tiny suite where artifact volume is trivial
a critical flow with a high business cost of failure
short-lived debugging during a release stabilization period
compliance or audit requirements that demand a stronger audit trail

But be careful, because always-on traces can become expensive in three ways:

Storage growth, especially for large suites or long-running jobs
Upload latency, which slows the pipeline
Signal dilution, because passing artifacts can hide the failures that matter

If you do decide to store traces always, I would still separate long-term retention from immediate triage. For example, keep full artifacts for 7 days, but keep a compact report or summary for longer if you need trend analysis.

Screenshots are useful, but they are not enough by themselves

Screenshots are fast, cheap, and often good enough for a visual assertion failure. They are also easy to attach to a CI report. But they have a narrow field of view.

A screenshot tells you what was on the page at a single instant. It does not tell you:

whether the page loaded slowly
whether the wrong request returned data
whether an overlay was present briefly
whether the click happened before the page became actionable
whether an assertion failed because the DOM changed milliseconds earlier

That is why I treat screenshots as a companion artifact, not the primary debugging record.

In Playwright, this still works well for specific checks:

import { test, expect } from '@playwright/test';

test('checkout summary stays visible', async ({ page }) => {
  await page.goto('/checkout');
  await expect(page.getByRole('heading', { name: 'Order summary' })).toBeVisible();
  await expect(page).toHaveScreenshot('checkout-summary.png');
});

When the screenshot fails, the trace is usually what explains why.

Videos help when timing is the bug

I like video artifacts when a failure seems related to animation, drag and drop, modal timing, focus management, or delayed rendering. A video is often the quickest way to answer questions like these:

Did the button exist when the click happened?
Did the spinner disappear too early?
Did the page scroll unexpectedly?
Did a toast cover a control?

Videos are especially useful when a trace tells you what happened, but not quite how the user perceived it. That visual layer matters for UI problems with motion or state transitions.

The downside is size. Videos can become the heaviest artifact in the pipeline, so I usually keep them only on failure. If your suite is large, storing every video for every passing run is often not worth it.

A better triage loop starts in the test itself

The fastest way to make artifact storage useful is to label tests well. Good test names and scoped steps are a triage multiplier.

If the test output says only “should work”, the artifact may still be useful, but the person investigating has extra work. If the test name says exactly what user journey failed, the screenshot or trace can be interpreted much faster.

I also like to use test.step for critical interactions, because those step labels appear in the trace and make it easier to identify where the failure occurred.

import { test, expect } from '@playwright/test';

test('user can place an order', async ({ page }) => {
  await test.step('open checkout page', async () => {
    await page.goto('/checkout');
  });

await test.step(‘submit payment details’, async () => { await page.getByLabel(‘Card number’).fill(‘4242424242424242’); await page.getByRole(‘button’, { name: ‘Pay now’ }).click(); });

await expect(page.getByText(‘Order confirmed’)).toBeVisible(); });

That kind of structure turns the trace into a readable story instead of a pile of actions.

Keep the artifacts close to the failure, but not too close to the code

One mistake I see in teams is attaching files only to the test code repository, while the actual CI run data lives somewhere else and expires quickly. Another mistake is the opposite, keeping everything in generic blob storage with no easy link back to the build that generated it.

A good setup usually has three layers:

CI job summary, which gives immediate access to the current run’s artifacts
longer-term storage, which holds the last few runs or failure bundles
test reporting system, which indexes failed tests, reruns, and evidence links

That way, if a developer sees a flaky test in the morning, they can inspect the latest failure without reconstructing the pipeline by hand.

How much retention do you really need?

Artifact retention is a policy decision, not just a storage setting. Ask these questions:

How long does it usually take someone to investigate a failure?
Do failures get discussed in the same day, or do they sit until the next sprint?
Do you need artifact history for compliance or postmortems?
Does the team rerun failures often, making older evidence less useful?

For many teams, 7 to 14 days is enough for CI failure artifacts. For regulated systems or release-heavy teams, you may need longer. What matters is being deliberate. If you keep everything forever, people stop trusting the artifact system because it becomes too noisy and too expensive.

Common pitfalls that slow triage down

Here are the mistakes I see most often when teams try to store Playwright artifacts in CI:

1. Uploading artifacts only on success

This is the fastest way to make evidence disappear exactly when you need it.

2. Forgetting `if: always()` or equivalent logic

If the upload step is skipped after a test failure, the job can fail before the evidence is saved.

3. Capturing too much

Always-on videos and traces for huge suites can make the pipeline slower and more expensive than necessary.

4. Capturing too little

A single screenshot without trace or test step context often forces a rerun.

5. Using unclear artifact names

If the artifact is called results.zip, nobody wants to open it.

6. Not matching artifact policy to suite behavior

A stable smoke suite and a flaky cross-browser suite should not necessarily share the same artifact retention rules.

When artifact storage and flake analysis work together

Flaky tests are often treated like a binary problem, pass or fail, but the evidence tells a richer story. If a test fails sporadically in CI, the trace can help answer whether the issue is:

a bad locator
a timing problem
an environment-specific rendering issue
a backend dependency that responded slowly
a genuine product regression

That is why I like pairing artifact capture with a small amount of metadata, such as browser, branch, commit SHA, and retry count. Once you can correlate failures across runs, debugging becomes pattern recognition instead of detective work.

The more deterministic the artifact naming and retention policy, the less time people spend arguing about whether a failure is reproducible.

A practical checklist for your pipeline

If you want a quick implementation checklist, this is the one I use most often:

enable trace capture on failure or first retry
enable screenshots on failure
enable videos on failure for the tests that need them
upload playwright-report/ and test-results/ as CI artifacts
make the upload step run even when the test job fails
include browser, OS, and run identifier in artifact names
set a retention window that matches your triage process
keep artifact paths predictable so developers do not hunt for files

If your team uses multiple CI providers, the same principles still apply. The syntax changes, but the core idea is stable, preserve evidence automatically and make the failure bundle easy to find.

My default recommendation

If I were setting this up for a team today, I would start with this policy:

traces: retain-on-failure
videos: retain-on-failure
screenshots: only-on-failure
upload artifacts on every failed job
keep a 7-day retention window initially
name bundles by browser and run ID

That setup gives you enough debugging evidence to stop rerunning failures blindly, without turning the pipeline into an artifact warehouse.

Once the team starts using the traces consistently, you can tune the policy based on real behavior. If most failures are visual, keep the screenshots and videos. If most failures are network or selector problems, the trace will do most of the work. If storage is growing too fast, tighten retention or reduce artifact capture on stable suites.

Closing thought

The real value of Playwright artifacts is not just that they exist, it is that they shorten the path from failure to explanation. When you store Playwright traces in CI, keep videos and screenshots alongside them, and upload them in a predictable way, triage stops feeling like archaeology.

You do not need a huge observability stack to get there. You need a consistent policy, a reliable upload step, and enough discipline to keep the evidence attached to the failure that produced it.

That is usually the difference between a flaky test that sits unresolved for days and a test that gets fixed before lunch.