How to Review AI-Generated Playwright Code

When AI writes a Playwright test, the first reaction is often relief. A page flow that would have taken 20 minutes to scaffold is suddenly on the screen, complete with selectors, assertions, and a happy path. The second reaction, at least in my experience as an SDET, should be skepticism.

AI can generate useful Playwright code quickly, but speed is not the same as correctness, and correctness is not the same as maintainability. A generated test may pass once and still be a bad test. It may exercise the right UI but assert the wrong thing. It may use locators that look sensible and still fail the moment the DOM shifts. It may be fine locally, then become noisy in CI.

That is why I treat review AI-generated Playwright code as a distinct engineering skill, not just a code review variant. You are not only checking style. You are checking whether the test is stable, meaningful, and cheap to maintain six months from now.

What makes AI-generated Playwright code different

Tools like Playwright are already expressive enough to produce readable tests quickly. When a model such as Claude or another coding assistant writes Playwright for you, it usually does one of two things:

It gives you a decent first draft with obvious gaps.
It gives you something that looks polished but encodes assumptions the model cannot verify.

The review problem is subtle because AI-generated test code is often syntactically correct and semantically plausible. That makes it easier to trust than it should be.

A test is not good because it runs. A test is good because it fails for the right reasons and stays readable when the app evolves.

For a reviewer, the key question is not, “Did the model understand the request?” It is, “Did this code encode the right test intent using stable, maintainable mechanisms?”

My review checklist for AI-generated Playwright tests

When I review AI-generated Playwright code, I check the same areas every time:

Locators
Waits and synchronization
Assertions
Fixtures and setup
Cleanup and isolation
Data handling
CI behavior and flake resistance
Maintainability, especially around page objects and helpers

If the test is long or the app is complex, I also look at what the model did not include. Missing negative assertions, missing cleanup, and missing state resets are common weak spots in Claude Playwright tests and similar generated output.

1) Start with the intent, not the code

Before I look at a single line, I ask what the test is trying to prove.

Examples:

Does the test verify that a user can sign up and reach the dashboard?
Does it prove that a checkout flow creates an order?
Does it confirm that a banner appears after a save?
Does it validate permissions or error handling?

A generated test often captures the flow but not the actual business rule. For example, a test might click “Submit”, wait for navigation, and assert that the URL changed. That is weaker than asserting that the confirmation state matches the business outcome.

If the test does not clearly map to a user-visible contract, I usually rewrite the assertion strategy before I accept the rest of the code.

2) Review locators for stability and user meaning

Locator quality is the biggest predictor of Playwright code maintenance. It is also where AI-generated code often looks fine but ages poorly.

What I prefer

In order of preference, I generally want locators that reflect the user-facing structure:

getByRole
getByLabel
getByPlaceholder when appropriate
getByText only when text is stable and unique enough
test IDs as a last resort for genuinely hard-to-target elements

A good generated test might look like this:

typescript

await page.getByRole('button', { name: 'Sign in' }).click();
await page.getByLabel('Email').fill('user@example.com');
await page.getByLabel('Password').fill('secret123');

What I flag

I become suspicious when I see:

deep CSS chains like .page > div:nth-child(2) > ...
XPath that depends on structure rather than meaning
selectors tied to generated class names
text matches that are overly broad
nth() usage without a strong reason

This is where AI generated Playwright code review often becomes a maintenance exercise. The model may choose the first matching node, which is convenient for code generation but brittle in the real app.

If the locator requires a screenshot and a prayer to understand, it is probably not a good test locator.

Special note on dynamic UIs

For component libraries, virtualized tables, and dashboards with repeated controls, I look for row-level anchoring. A good pattern is to scope the search to a section, then locate by role inside that section.

typescript

const customerRow = page.getByRole('row', { name: /Acme Corp/ });
await customerRow.getByRole('button', { name: 'Edit' }).click();

That is much easier to maintain than selecting the third button on the page and hoping the layout stays still.

3) Check waits carefully, because AI loves to over-wait or under-wait

Playwright has strong auto-waiting, but it is not magic. AI-generated tests often add explicit waits in places where they are unnecessary, or skip them where they are actually needed.

Red flags

I usually remove or question:

waitForTimeout(...)
arbitrary sleeps between steps
waits for selectors that the next action already handles
repeated waits that hide a slow or unstable app state

A common anti-pattern looks like this:

typescript

await page.getByRole('button', { name: 'Save' }).click();
await page.waitForTimeout(3000);
await expect(page.getByText('Saved')).toBeVisible();

That test is slower than necessary and still not very reliable. A better pattern is to wait on the actual outcome:

typescript

await page.getByRole('button', { name: 'Save' }).click();
await expect(page.getByRole('status')).toHaveText('Saved');

What I want instead

I prefer waits that reflect application state:

URL changes
dialog visibility
network-driven UI updates that resolve into visible state
meaningful DOM changes tied to the user action

If the test depends on a backend side effect, I sometimes add a dedicated assertion on the resulting state, or use expect.poll() for a bounded condition. The key is to wait for the thing that matters, not a guessed duration.

4) Assertions should prove business value, not just page presence

This is one of the most common weaknesses I see in generated tests. AI can create a flow that clicks through the app, but the assertions end up being shallow.

Weak assertions

These are technically valid but often not enough:

page title changed
URL contains a path fragment
element exists
toast appeared and disappeared

Better assertions

I look for assertions that answer the question, “What outcome did the user get?”

typescript

await expect(page.getByRole('heading', { name: 'Order confirmed' })).toBeVisible();
await expect(page.getByText('Thank you for your purchase')).toBeVisible();
await expect(page.getByRole('link', { name: 'View receipt' })).toBeVisible();

That is still UI-level, but it is meaningfully closer to the product behavior.

Negative assertions matter too

AI-generated tests often forget to assert the absence of errors. If the flow should succeed, I sometimes add checks like:

typescript

await expect(page.getByText('Invalid password')).not.toBeVisible();
await expect(page.getByRole('alert')).not.toContainText(/error/i);

Use negative assertions carefully, because they can become noisy if the app renders hidden elements or delayed notifications. But for critical paths, they are worth considering.

When text is not enough

If the UI is heavily visual or the assertion depends on broader context, I prefer a more semantic check. In some teams, this is a good place to use Endtest AI Assertions because the validation can be written in plain English instead of hard-coding a brittle string match. That is not a replacement for good Playwright assertions in code, but it can be a cleaner way to express intent when the exact UI text is less important than the behavior.

5) Fixtures and setup should be intentional

AI-generated Playwright code often includes setup that works locally but is awkward in a real suite. I pay close attention to fixtures because they determine how the test scales.

Questions I ask

Is the test using beforeEach when it should be isolated inside the test?
Does it share state across tests in a way that hides dependencies?
Is authentication handled through the UI every time, or can it use a stable authenticated state?
Are helper functions doing too much?

For many suites, I prefer a setup project or storage state file rather than a UI login on every test run. That speeds up the suite and reduces noise from login page changes.

import { test } from '@playwright/test';

test.use({ storageState: ‘auth.json’ });

test('can create a project', async ({ page }) => {
  await page.goto('/projects/new');
  // test steps here
});

If an AI-generated test repeats login steps in every file, I usually refactor that out unless login itself is the subject of the test.

Be careful with hidden dependencies

A model may create a fixture that seeds data, logs in as a superuser, and assumes the app starts in a clean state. That can be fine for a single test, but in a suite it becomes fragile. Review the assumptions explicitly:

does the fixture create unique test data?
does it clean up after itself?
is the data scoped to the test, not the user account?

6) Cleanup is not optional

This is where generated tests often drift from “demo” into “maintenance trap”. If the test creates data, I want to know how that data is removed or isolated.

What to look for

API cleanup after UI setup
unique test data per run
data reset in test environment
teardown hooks that run reliably

If the test creates an account, order, ticket, or record, I ask how the next run avoids collisions. A generated test that uses the same email address every time is a classic red flag.

typescript

const email = `test-${Date.now()}@example.com`;

That is better than a hard-coded address, but timestamps can still collide in parallel runs. A unique ID or test-run prefix is usually safer.

import { test } from '@playwright/test';

const email = test-${test.info().workerIndex}-${Date.now()}@example.com;

Even better, if the app supports it, create and delete via API in a helper so the UI test focuses on the behavior under test.

7) Watch for over-abstracted page objects

AI assistants often generate page objects because they look like “good structure.” Sometimes they are. Sometimes they are just indirection for its own sake.

I like page objects when they:

centralize high-churn locators
provide a stable API for repeated workflows
keep tests readable without hiding important actions

I dislike them when they:

wrap every single click and assertion in a method
make the test harder to understand than the raw Playwright steps
hide business meaning behind generic method names like submitForm()

A test should still read like a user journey. If the page object turns the test into a puzzle, the abstraction is too heavy.

8) Audit the CI behavior, not just local execution

A test that passes locally can still be a bad citizen in CI. This is especially important when reviewing AI-generated Playwright code because models do not know your pipeline constraints unless you make them explicit.

Things I check for CI friendliness

Does the test assume headed mode?
Does it rely on a fixed viewport without reason?
Is tracing or video enabled where debugging will matter?
Does it use hard-coded timeouts that hide slowness?
Is the suite safe to run in parallel?

Example CI setup

name: playwright
on: [push, pull_request]

jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npx playwright install –with-deps - run: npx playwright test

That is fine as a starting point, but review the test code for assumptions about shared state, environment data, and the order of execution.

Flakiness signals in generated code

I get nervous when I see these patterns:

tests depend on previous tests
test.describe.serial is used to paper over shared state
errors are swallowed with try/catch and no rethrow
assertions are placed too early, before the UI is ready
retries are added without addressing the cause

Retries are a debugging aid, not a cure. If the code is flaky because of poor locators or bad synchronization, the review should fix the root cause.

9) Ask whether the generated code is actually maintainable

Maintenance is where AI-generated Playwright code either pays off or costs you later. When I review, I look for clues that the test will be easy to update.

Good maintainability signs

short tests with obvious intent
reusable helpers for repeated workflows
locators based on roles and labels
assertions close to the action they validate
minimal duplication

Bad maintainability signs

long linear scripts with many unrelated steps
many copied selectors
unclear helper names
assertions buried far from the action
magic constants without explanation

A short test with a clear name is often better than a clever abstraction. The goal is not code elegance for its own sake, it is survivable test code.

10) Decide when to rewrite versus accept

Not every AI-generated test needs a full rewrite. My rule of thumb is simple:

Accept it if the intent is correct, the locators are stable, and the assertions are meaningful.
Edit it if the structure is good but the details are brittle.
Rewrite it if the model misunderstood the user journey or embedded too much fragility.

I am especially likely to rewrite when the generated code:

hard-codes timing assumptions
uses brittle selectors throughout
makes shallow assertions
duplicates setup in every test
mixes multiple scenarios into one long flow

If you are reviewing Claude Playwright tests or another LLM-generated test batch, it helps to treat them like PRs from a junior engineer who is fast, helpful, and occasionally overconfident. The code deserves respect, but not blind trust.

A practical review rubric I use

Here is the quick rubric I use before approving AI-generated Playwright code:

Does the test validate a real user outcome?
Are locators based on accessible, stable signals?
Does the test avoid arbitrary sleeps?
Are assertions specific enough to fail for the right reason?
Is setup isolated and reusable?
Is cleanup handled or unnecessary by design?
Will this still be readable in six months?
Is it safe for CI and parallel runs?

If I cannot answer yes to most of these, the test is not ready, even if it passes.

Where Endtest fits better for some teams

I like Playwright when a team wants code-first control, custom logic, and close integration with a TypeScript or JavaScript stack. But not every team wants to review and maintain full generated test code. In many organizations, the real bottleneck is not authoring a test, it is reviewing, maintaining, and keeping everyone aligned on what the test actually does.

That is why I think Endtest is a stronger fit for teams that want AI-assisted authoring without inheriting as much code maintenance overhead. Its agentic AI Test Creation Agent generates editable, platform-native test steps instead of leaving the team to inspect a large block of source code. You can review the steps, adjust them, and keep the suite in a shared surface that is easier for testers, developers, and managers to reason about.

In other words, if your team is spending too much time reviewing AI-generated Playwright code line by line, a low-code workflow can be the more efficient path. That is also where features like self-healing locators and AI-driven validation become useful, because they reduce the amount of hand-edited maintenance work you carry forward. For teams already comparing options, Endtest’s Playwright comparison is worth a look because it frames the tradeoff clearly: code-first flexibility versus a more reviewable, editable test authoring model.

Final thoughts

To review AI-generated Playwright code well, you have to think like both a tester and a maintainer. The test must express the right user behavior, but it also has to survive real application change. That means paying attention to locators, synchronization, assertions, fixtures, cleanup, and CI behavior, not just syntax.

My default stance is simple: AI can draft the test, but humans should approve the intent, the resilience, and the maintenance cost. When the code is clean, that review is quick. When it is not, the review is the difference between a useful test and another flaky file that gets rerun until the team stops trusting it.

If your team prefers reviewing behavior over reading generated code, agentic AI test platforms can reduce a lot of that friction. For code-first teams, the same review discipline still applies. The only difference is that with Playwright, you own every line.

What makes AI-generated Playwright code different

My review checklist for AI-generated Playwright tests

1) Start with the intent, not the code

2) Review locators for stability and user meaning

What I prefer

What I flag

Special note on dynamic UIs

3) Check waits carefully, because AI loves to over-wait or under-wait

Red flags

What I want instead

4) Assertions should prove business value, not just page presence

Weak assertions

Better assertions

Negative assertions matter too

When text is not enough

5) Fixtures and setup should be intentional

Questions I ask

A better pattern for login state

Be careful with hidden dependencies

6) Cleanup is not optional

What to look for

7) Watch for over-abstracted page objects

8) Audit the CI behavior, not just local execution

Things I check for CI friendliness

Example CI setup

Flakiness signals in generated code

9) Ask whether the generated code is actually maintainable

Good maintainability signs

Bad maintainability signs

10) Decide when to rewrite versus accept

A practical review rubric I use

Where Endtest fits better for some teams

Final thoughts