If you maintain a real Playwright suite, you already know the pattern. The tests start clean, then a few months later they accumulate repeated login helpers, brittle selectors, inconsistent waits, and one-off workarounds for edge cases that no one remembers. That is where Claude can be genuinely useful, not as a magic replacement for engineering judgment, but as a refactoring assistant that can speed up the boring parts of test maintenance.

This article walks through how I use Claude to refactor Playwright tests, what it is good at, where it tends to produce unsafe changes, and how I validate the result before merging. I will also call out when the better answer is not more Playwright code at all, but a simpler approach such as Endtest, which uses agentic AI and a managed platform model to avoid the cycle of generating, refactoring, and repairing test code.

Why Playwright tests need refactoring in the first place

Playwright is a strong choice for browser automation, and the official docs are clear about its capabilities and cross-browser coverage, especially in modern web apps, see the Playwright docs. But like any code-heavy test framework, it creates maintenance obligations.

Common refactoring triggers include:

  • Repeated setup and login code across many specs
  • Fragile selectors based on CSS structure or auto-generated classes
  • Long tests that mix setup, assertion, and business flow in one file
  • Custom utilities that drift from actual app behavior
  • Heavy use of waitForTimeout, which often signals uncertainty rather than synchronization
  • Assertions that are too weak, such as only checking that a button exists

At scale, test maintenance becomes a product of two things, application churn and test design. Claude can help with both, but only if you feed it enough structure and keep a tight review loop.

AI is most useful for refactoring tests when the desired change is already clear. It is much less reliable when you want it to infer missing intent from a flaky or poorly designed test.

What Claude is good at in Playwright maintenance

When I ask Claude to refactor Playwright tests, I typically want one or more of these outcomes:

  1. Extract duplicate setup into fixtures or helpers
  2. Replace brittle selectors with role-based or text-based locators
  3. Simplify repeated assertions into reusable functions
  4. Reorganize a spec into smaller, more readable units
  5. Normalize test naming, structure, and conventions
  6. Update obsolete Playwright APIs or patterns

Claude is especially good at recognizing code smells across a test file or across a small group of files. For example, it can spot that three tests all perform the same authentication flow and suggest a shared beforeEach or custom fixture.

It is less trustworthy when the refactor depends on nuanced application behavior, hidden state, or test data dependencies that are not obvious from the code alone. That is why I treat Claude as a pair programmer, not an autonomous maintainer.

Start with a clean refactor target

Before you send code to Claude, decide what kind of refactor you actually want. Do not ask for “make this better.” That is how you get generic cleanup or a style pass that misses the real issue.

Use a concrete goal such as:

  • “Extract the login flow into a reusable fixture”
  • “Replace brittle CSS selectors with semantic locators”
  • “Split this end-to-end flow into setup, action, and assertion helpers”
  • “Reduce duplicated waits and remove unnecessary timeouts”

A good prompt includes:

  • The relevant Playwright file or files
  • The app context, if the test names are not obvious
  • The refactor goal
  • Any constraints, such as “do not change behavior” or “keep assertions equivalent”
  • Any team conventions, such as fixture style or naming rules

Here is the kind of prompt I use:

text Refactor this Playwright test for readability and maintainability.

Goals:

  • Extract repeated login setup into a reusable helper or fixture
  • Replace brittle selectors with semantic locators when possible
  • Keep the test behavior the same
  • Do not introduce new dependencies
  • Preserve assertions, but feel free to reorganize them

After refactoring, explain the changes and point out any places where behavior may have changed.

That last sentence matters. If Claude changes behavior, I want it to say so explicitly.

Example: refactoring a brittle Playwright test

Here is a simplified test that works, but is not pleasant to maintain:

import { test, expect } from '@playwright/test';
test('user can upgrade plan', async ({ page }) => {
  await page.goto('https://example-app.com/login');
  await page.locator('#email').fill('demo@example.com');
  await page.locator('#password').fill('secret123');
  await page.locator('button[type="submit"]').click();

await page.goto(‘https://example-app.com/billing’); await page.locator(‘.plan-card:nth-child(2) .select-plan’).click(); await page.locator(‘.modal .confirm’).click();

await expect(page.locator(‘.toast-success’)).toContainText(‘Plan upgraded’); });

Problems I would flag immediately:

  • Login steps are embedded in the test body
  • Selectors rely on CSS structure and ordinal position
  • The flow mixes navigation, action, and assertion without structure
  • The test is fragile if the DOM changes

A better version might look like this:

import { test, expect } from '@playwright/test';

async function login(page) { await page.goto(‘https://example-app.com/login’); await page.getByLabel(‘Email’).fill(‘demo@example.com’); await page.getByLabel(‘Password’).fill(‘secret123’); await page.getByRole(‘button’, { name: ‘Sign in’ }).click(); }

test('user can upgrade plan', async ({ page }) => {
  await login(page);
  await page.goto('https://example-app.com/billing');

await page.getByRole(‘button’, { name: ‘Select Pro plan’ }).click(); await page.getByRole(‘button’, { name: ‘Confirm upgrade’ }).click();

await expect(page.getByRole(‘status’)).toHaveText(/plan upgraded/i); });

This is not just prettier. It is easier to reason about, easier to reuse, and more resistant to DOM churn. The key is that the refactor keeps behavior stable while improving the test surface.

How I prompt Claude for a safer refactor

Claude is much more reliable when you constrain the output format. I usually ask for one of these:

  • A full rewritten file
  • A diff-style explanation with code blocks
  • A step-by-step refactor plan before code
  • A list of risky assumptions before the final rewrite

If I suspect the test may hide dependencies, I ask Claude to identify them first.

For example:

text Before rewriting the test, list any assumptions about app behavior, selector stability, or test data that you are making. Then provide the refactored code.

This helps surface issues like:

  • The test depends on seeded data
  • A modal only appears after async backend work
  • A locator change may alter which element is clicked
  • A helper function may need to be shared across multiple specs

That prompt pattern is especially useful when you are doing AI test code refactor work in a large suite, because hidden assumptions are where regressions begin.

Refactoring patterns Claude handles well

1. Extracting repeated flows into helpers

If three tests log in the same way, Claude can usually extract a helper without much trouble. I prefer a helper when the flow is common but still test-local.

typescript

async function signInAsDemo(page) {
  await page.goto('/login');
  await page.getByLabel('Email').fill('demo@example.com');
  await page.getByLabel('Password').fill('secret123');
  await page.getByRole('button', { name: 'Sign in' }).click();
}

If the flow is broader, a fixture may be better than a helper, because it gives you lifecycle control and cleaner test signatures.

2. Replacing brittle locators

Claude is good at suggesting semantic locators when the page structure makes the intention clear.

Prefer:

  • getByRole
  • getByLabel
  • getByText
  • getByPlaceholder when appropriate

Avoid, when possible:

  • Deep CSS chains
  • nth-child selectors
  • Class names generated by build tools or component libraries

Playwright’s locator model is already designed to support resilient test code, so this is a natural refactor target.

3. Removing waitForTimeout

If your tests use arbitrary sleep calls, Claude can usually replace them with better synchronization, but this is one area where you need careful review. A sleep often hides a missing assertion or a state condition.

Instead of:

typescript

await page.waitForTimeout(2000);

look for a state-based wait, such as:

typescript

await expect(page.getByRole('dialog')).toBeVisible();

or:

typescript

await expect(page.getByText('Upload complete')).toBeVisible();

Claude can recommend these substitutions, but it cannot always infer the true readiness signal from code alone.

4. Consolidating assertions

Sometimes tests are full of low-value assertions that do not tell you whether the user journey succeeded. Claude can help move toward assertions that match user-visible outcomes.

A useful refactor often shifts from internal details to visible results, such as confirming a status message, a route change, or a rendered record in a table.

Where human review is non-negotiable

There are several situations where I never merge Claude’s refactor without a careful manual pass.

Selector changes can alter semantics

If Claude swaps a locator, I verify that it still targets the same user-facing element. For example, a button may have the same visible text as another button in a different area of the page. The code may still run, but the test now checks the wrong behavior.

Shared helpers can create hidden coupling

A helper that looks elegant can accidentally bake in assumptions about auth state, viewport, feature flags, or seeded data. That coupling is not always obvious until another test starts failing.

Refactors can hide flaky behavior instead of fixing it

If a test was flaky because the app is genuinely unstable, moving code into a helper does not solve the problem. It only makes the file look cleaner. I still need to identify whether the failure is caused by timing, locator instability, backend dependency, or test data collision.

Coverage can silently shrink

Sometimes Claude removes a step that looked redundant, but that step was actually important for coverage. The code still reads well, but the test no longer exercises the user journey fully.

A refactor that improves readability but weakens coverage is a regression, even if the test still passes.

A practical review checklist after Claude refactors Playwright tests

When Claude gives me a rewritten test, I check the following before merging:

  • Does the test still reflect the original behavior?
  • Are the new locators semantically correct and unique enough?
  • Did any assertions disappear, and if so, was that intentional?
  • Did shared helpers increase coupling or reduce clarity?
  • Did the refactor introduce hidden test data requirements?
  • Are waits based on application state instead of time?
  • Will this change help the next person debug a failure faster?

I also run the test in the same CI context where it normally lives, not just locally. A refactor that passes on my machine but fails in pipeline timing conditions is not done.

Using Claude for broader Playwright maintenance

Claude is not only useful for one-off refactors. It can help with maintenance tasks across a suite:

  • Standardizing fixture names
  • Turning repeated setup into shared utilities
  • Renaming test descriptions for consistency
  • Migrating older Playwright syntax to current patterns
  • Cleaning up inconsistent assertion styles
  • Identifying tests that should be split or deleted

This is where AI test code refactor work becomes valuable at scale. The time savings come from reducing repetitive mechanical edits, not from outsourcing judgment.

A good workflow is:

  1. Ask Claude to summarize the test smell
  2. Ask for a refactor plan
  3. Request the code change
  4. Review the behavioral risks
  5. Run the updated tests locally and in CI
  6. Commit only after the suite proves the refactor is safe

Claude is helpful, but Playwright can still be the wrong maintenance model

If your team has strong engineering ownership and you want code-level control, Playwright plus Claude can be a productive combination. But there is a deeper question worth asking, especially for QA leads and founders: do you actually want to keep maintaining browser automation code at all?

That is where a managed platform can be simpler. Endtest is positioned as a Playwright alternative for teams that do not want to own the full code, framework, and maintenance cycle. Its AI Test Creation Agent creates editable Endtest tests from plain-English scenarios, and its agentic approach extends into maintenance, not just creation.

The difference matters because Playwright plus Claude still leaves you with:

  • Framework decisions
  • Selector strategy
  • Test data management
  • CI wiring
  • Browser execution setup
  • Refactoring and repair cycles as the app changes

Endtest changes the shape of that work. Instead of generating code, refactoring it, and repairing it later, the platform gives you editable test steps in its own environment, with self-healing tests that can recover when locators break due to UI changes. That is a very different maintenance story.

For teams that want direct code ownership, Claude is a useful accelerator. For teams that want less framework overhead, a platform like Endtest can be the simpler path because it reduces the number of times you need to touch the automation implementation at all.

When I would choose Claude plus Playwright

I would lean into Claude-assisted refactoring if:

  • The team already owns a mature Playwright suite
  • Developers are comfortable reviewing test code changes
  • You need flexibility that a code framework provides
  • Your main pain is maintenance, not framework adoption
  • You want to migrate toward cleaner test architecture over time

This setup works well for teams that treat test code like production code, with reviews, conventions, and CI discipline.

When I would choose a simpler platform instead

I would seriously consider Endtest if:

  • The team does not want to own test framework infrastructure
  • QA and product people need to author tests without code
  • Locator churn is the main source of wasted time
  • You want less dependency on TypeScript skills for every change
  • You prefer a managed platform over a codebase that must be refactored continuously

In other words, if you find yourself repeatedly asking Claude to repair brittle Playwright tests, the deeper problem may not be prompt quality. It may be that your team is using a code-heavy model where a lower-maintenance platform would fit better.

A balanced recommendation

Claude is a strong assistant for refactoring Playwright tests, especially when the change is structural, repetitive, and well-scoped. It can help you clean up selectors, extract helpers, remove duplication, and improve readability. But it does not replace a human who understands the app, the product risk, and the meaning of the assertions.

My rule of thumb is simple:

  • Use Claude to speed up mechanical refactoring
  • Use human review to protect semantics and coverage
  • Use Playwright when code ownership is a feature, not a burden
  • Use a platform like Endtest when you want to reduce the maintenance loop itself

If your suite is growing messy, start with one test file and one specific refactor goal. Ask Claude to improve the structure, then verify that every behavior still matters. That disciplined workflow gives you the benefit of AI without turning test quality into guesswork.

And if the real problem is that your team is tired of living inside the refactor-repair cycle, then the better fix may be to step back from code-first automation altogether.