How to Store Playwright Test Artifacts in CI So Failure Triage Is Actually Fast

When a Playwright test fails in CI, the failure itself is only half the problem. The real cost usually comes later, when someone has to answer a simple question: what actually happened? If the only evidence is a red job name and a stack trace, triage turns into guesswork.

That is why I care so much about how to store Playwright test artifacts in CI. The goal is not to collect everything just because the runner can. The goal is to capture the right evidence, organize it so the failure is easy to inspect, and make sure the artifacts are still available when someone needs them, not just when the pipeline is fresh in memory.

In this article, I will show how I think about artifacts for Playwright test runs, what to store, when to store it, and how to wire it into GitHub Actions. I will also cover a few tradeoffs that matter in real teams, because storing too much can be almost as painful as storing too little.

What counts as a useful Playwright artifact?

A useful artifact is anything that helps you answer one of these questions quickly:

What did the test see?
What page state led to the failure?
Was the problem in the test, the app, or the environment?
Can I reproduce this without guessing?

For Playwright, the most valuable artifact types are usually:

Trace files for step-by-step debugging in trace viewer
Screenshots on failure, and sometimes on assertion checkpoints
Videos for visual context and timing issues
Console logs from the browser context
Network logs or request failures, when API behavior matters
Test runner logs, including retry counts, annotations, and timing

If you only store screenshots, you often know what failed. If you store traces, you often know why.

That distinction matters. A screenshot can show that a login button was missing. A trace can show the element was never rendered because an upstream request returned 500, a modal intercepted the click, or the test navigated too early.

Not every test should produce the same artifacts. If you capture screenshots, videos, and traces on every single run, storage costs and upload times can grow quickly, especially in large suites.

A good default is:

Always keep traces for failed tests
Capture screenshots for failed tests
Capture video only for failed tests or retries
Keep logs for every run, but compress them
Store successful run artifacts selectively, only when needed for debugging or change detection

Playwright already gives you options that make this easier through its config. A solid baseline looks like this:

import { defineConfig } from '@playwright/test';

export default defineConfig({ use: { trace: ‘retain-on-failure’, screenshot: ‘only-on-failure’, video: ‘retain-on-failure’, }, });

This is a practical starting point because it balances triage value with storage volume. For many teams, this is enough to turn a flaky build from opaque into debuggable.

Why trace files should be your default debugging artifact

If I had to choose one artifact for CI triage, I would choose the trace file.

Playwright trace files preserve a rich timeline of the test run, including actions, locators, DOM snapshots, network activity, console output, and timings. That makes them ideal for failure analysis, especially when the issue is intermittent.

The trace viewer lets you step through the test as it ran, which is often much faster than reading a long CI log. It answers common debugging questions such as:

Did the locator match the right element?
Was the page still loading?
Did the click fail because of an overlay?
Did an API request return the wrong data?

You can enable tracing manually in a test, or let the config handle it automatically for failures.

import { test } from '@playwright/test';

test('example flow', async ({ page }) => {
  await page.goto('https://example.com');
  await page.getByRole('button', { name: 'Submit' }).click();
});

The key is not the test code itself, it is the policy around traces. In CI, I prefer a rule that says, “always keep the trace when a test fails or is retried.” That means every failure has a forensic trail.

Store artifacts with the failure, not in a separate mystery bucket

The fastest triage flow is the one where a failing test and its evidence sit together.

A common mistake is dumping every screenshot and trace into one generic folder or one artifact zip for the whole job. That sounds organized at first, but it becomes annoying when one test fails and you need to search through dozens of similarly named files.

A better structure is to organize by:

CI run id
browser name
test file or test title
retry attempt

For example:

text artifacts/ run-1842/ chromium/ checkout.spec.ts/ retry-0/ trace.zip screenshot.png console.log retry-1/ trace.zip screenshot.png

This kind of layout makes it obvious which attempt produced which artifact. That matters a lot when a test passes on retry. You want to inspect the failed attempt, not the successful retry that hides the problem.

GitHub Actions artifact upload pattern

If you use GitHub Actions, the built-in artifact upload step is usually enough for most teams. The important part is making sure uploads happen even when tests fail, and making the artifact names descriptive enough to match the run.

A practical workflow often looks like this:

name: e2e

on: [push, pull_request]

jobs: playwright: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npx playwright install –with-deps - run: npx playwright test - uses: actions/upload-artifact@v4 if: failure() with: name: playwright-artifacts-$ path: | playwright-report/ test-results/ retention-days: 14

A few things to notice here:

if: failure() ensures uploads happen only when needed, which controls volume.
The artifact name includes github.run_id, which makes it easier to correlate with the job.
playwright-report/ and test-results/ usually contain enough data to start triage.
retention-days should match how long your team typically needs to investigate failures.

If your team spends time on slow-moving flakes, 7 days may be too short. If your failure rate is low and your debug cycle is fast, 30 days may be unnecessary. Retention should match your real triage window, not a guess.

Capture logs that help explain the failure, not just the stack trace

Stack traces are useful, but they rarely explain browser context. For that, I like to add structured logging around the test and a browser console hook in Playwright.

import { test } from '@playwright/test';

test.beforeEach(async ({ page }) => { page.on(‘console’, msg => { console.log([browser:${msg.type()}] ${msg.text()}); }); });

This gives you browser-side messages in the CI logs, which can be incredibly helpful when a frontend app logs warnings or errors before the test fails.

I also recommend logging the key app state at the point of failure, when it is safe to do so. That can mean:

current URL
authenticated user or role
feature flag state
API response status for the relevant request
the test retry count

The goal is not verbose logging everywhere. The goal is to make the failure self-describing.

Decide which artifacts deserve screenshots and videos

Screenshots and videos are not interchangeable.

Use screenshots when:

you want a fast visual snapshot of the failure
an assertion compares expected text or layout
a modal, toast, or missing component is involved
you need a simple artifact for bug reports or PR comments

Use videos when:

timing or animation is involved
the issue is hard to capture in a single frame
drag-and-drop, hover, or scrolling behavior matters
the team needs to see the sequence of events

That said, video can become expensive if you store it for every test. If your suite is large, a smart policy is to only retain video for failures and retries, or for a tagged subset of high-risk tests.

A single screenshot can be enough to confirm a layout regression. A video is more useful when the failure is about motion, timing, or interaction order.

Make retries visible in artifact naming

Retries are one of the places where teams accidentally lose the truth.

Imagine a test that fails on the first attempt, then passes on retry. Your job ends green, but the bug is still real. If you only keep the final successful state, you have no clue what happened.

That is why artifact naming should include the retry attempt number. Playwright exposes retry information through the test info object, so you can use it in filenames or directories.

import { test } from '@playwright/test';
import fs from 'fs';

// Example only, keep artifact naming consistent with your runner setup. test.afterEach(async ({}, testInfo) => { if (testInfo.status !== testInfo.expectedStatus) { fs.mkdirSync(testInfo.outputDir, { recursive: true }); await testInfo.attach(‘failure-log’, { body: Buffer.from(failed on retry ${testInfo.retry}), contentType: ‘text/plain’, }); } });

The exact implementation can vary, but the principle is consistent: preserve the failed attempt, not just the final status.

Keep the artifact pipeline boring

The best artifact pipeline is the one no one has to think about during a bad day.

That means:

artifact collection is automatic
uploads happen even when the job fails
naming is predictable
retention is defined
the team knows where to find the evidence

If your current setup requires a developer to manually download a zip, rename files, and ask around for the right build, it is too fragile.

I also like to keep the output paths stable across environments. If local test runs and CI runs write artifacts to different directories, troubleshooting gets slower. Try to use the same conventions in both places, even if the upload step differs.

Common mistakes that slow down triage

Here are the mistakes I see most often when teams try to store Playwright test artifacts in CI.

1. Uploading everything on every run

This creates storage noise and makes it hard to find the failures that matter.

2. Saving only screenshots

Screenshots help, but they rarely tell the full story.

3. Not keeping failed retries

This erases the exact evidence you need to understand flakiness.

4. Using generic artifact names

Names like output.zip or results are not enough once the team grows.

5. Ignoring logs and console output

A browser console error or failed network request is often the real clue.

6. Short retention windows

If your team triages flaky tests after the PR closes, you need artifacts to stick around long enough.

A practical storage policy for most teams

If I were setting this up for a team starting from scratch, I would use this policy:

Trace, always retain on failure
Screenshot, retain on failure
Video, retain on failure and retry
Console and runner logs, always keep in job output, and optionally attach as text artifacts
Artifact retention, 7 to 14 days for normal teams, longer for slower triage cycles
Folder structure, grouped by run id, browser, test file, and retry

This policy gives you enough detail to debug flaky failures without turning your CI into an archive server.

When to go beyond Playwright defaults

Playwright’s built-in artifact handling is good, but some teams eventually need more control.

You might need custom handling if:

your CI system has artifact size limits
you want to redact sensitive data before upload
you need to send artifacts to object storage instead of CI storage
you want to enrich artifacts with metadata, like PR number or feature flag state
you need to correlate failures across multiple services or jobs

At that point, you can write a thin layer around Playwright that collects extra context and uploads it to your storage backend of choice. The important part is still the same, keep the failure evidence close to the failure and easy to navigate.

A note on flakiness and reproducibility

Artifact storage does not fix flaky tests by itself, but it shortens the feedback loop so you can actually fix them.

A trace can reveal that the app was not ready, a selector was too brittle, or a dependency returned inconsistent data. A screenshot can reveal layout overlap. Logs can show that a request timed out only on one browser. Once you see the pattern, you can usually decide whether the right fix is a better wait, a stronger locator, a test data cleanup, or an application change.

That is why I treat artifact strategy as part of test design, not just CI plumbing.

Where this fits if you are evaluating other platforms

If your team is still deciding whether to keep building on Playwright or move some workflows into a managed platform, the artifact question is a useful lens. Some teams want full control over CI and debugging, while others prefer a platform that combines execution, reporting, and maintenance features in one place. For example, Endtest is one alternative that uses agentic AI and includes self-healing behavior plus reporting workflows, which can reduce some of the manual maintenance burden if your team is looking for a broader platform rather than a library-first approach.

Final checklist

Before you call your CI setup done, make sure you can answer these questions quickly:

Can I find the failed test’s trace in one click?
Do I know which retry failed?
Are screenshots and logs stored with the same run?
Does the artifact name include enough context to identify the browser and test?
Are failure artifacts retained long enough for real triage?
Can someone on the team inspect the issue without asking the original author?

If the answer to any of those is no, you still have work to do.

Good artifact handling is one of those unglamorous engineering habits that pays off every week. Once you can reliably store Playwright test artifacts in CI, your failures stop being mysteries and start being tickets you can actually resolve.