June 7, 2026
How to Review AI-Generated Playwright Code
A practical SDET guide to review AI-generated Playwright code, covering locators, waits, assertions, fixtures, cleanup, and CI behavior.
When AI writes a Playwright test, the first reaction is often relief. A page flow that would have taken 20 minutes to scaffold is suddenly on the screen, complete with selectors, assertions, and a happy path. The second reaction, at least in my experience as an SDET, should be skepticism.
AI can generate useful Playwright code quickly, but speed is not the same as correctness, and correctness is not the same as maintainability. A generated test may pass once and still be a bad test. It may exercise the right UI but assert the wrong thing. It may use locators that look sensible and still fail the moment the DOM shifts. It may be fine locally, then become noisy in CI.
That is why I treat review AI-generated Playwright code as a distinct engineering skill, not just a code review variant. You are not only checking style. You are checking whether the test is stable, meaningful, and cheap to maintain six months from now.
What makes AI-generated Playwright code different
Tools like Playwright are already expressive enough to produce readable tests quickly. When a model such as Claude or another coding assistant writes Playwright for you, it usually does one of two things:
- It gives you a decent first draft with obvious gaps.
- It gives you something that looks polished but encodes assumptions the model cannot verify.
The review problem is subtle because AI-generated test code is often syntactically correct and semantically plausible. That makes it easier to trust than it should be.
A test is not good because it runs. A test is good because it fails for the right reasons and stays readable when the app evolves.
For a reviewer, the key question is not, “Did the model understand the request?” It is, “Did this code encode the right test intent using stable, maintainable mechanisms?”
My review checklist for AI-generated Playwright tests
When I review AI-generated Playwright code, I check the same areas every time:
- Locators
- Waits and synchronization
- Assertions
- Fixtures and setup
- Cleanup and isolation
- Data handling
- CI behavior and flake resistance
- Maintainability, especially around page objects and helpers
If the test is long or the app is complex, I also look at what the model did not include. Missing negative assertions, missing cleanup, and missing state resets are common weak spots in Claude Playwright tests and similar generated output.
1) Start with the intent, not the code
Before I look at a single line, I ask what the test is trying to prove.
Examples:
- Does the test verify that a user can sign up and reach the dashboard?
- Does it prove that a checkout flow creates an order?
- Does it confirm that a banner appears after a save?
- Does it validate permissions or error handling?
A generated test often captures the flow but not the actual business rule. For example, a test might click “Submit”, wait for navigation, and assert that the URL changed. That is weaker than asserting that the confirmation state matches the business outcome.
If the test does not clearly map to a user-visible contract, I usually rewrite the assertion strategy before I accept the rest of the code.
2) Review locators for stability and user meaning
Locator quality is the biggest predictor of Playwright code maintenance. It is also where AI-generated code often looks fine but ages poorly.
What I prefer
In order of preference, I generally want locators that reflect the user-facing structure:
getByRolegetByLabelgetByPlaceholderwhen appropriategetByTextonly when text is stable and unique enough- test IDs as a last resort for genuinely hard-to-target elements
A good generated test might look like this:
typescript
await page.getByRole('button', { name: 'Sign in' }).click();
await page.getByLabel('Email').fill('user@example.com');
await page.getByLabel('Password').fill('secret123');
What I flag
I become suspicious when I see:
- deep CSS chains like
.page > div:nth-child(2) > ... - XPath that depends on structure rather than meaning
- selectors tied to generated class names
- text matches that are overly broad
nth()usage without a strong reason
This is where AI generated Playwright code review often becomes a maintenance exercise. The model may choose the first matching node, which is convenient for code generation but brittle in the real app.
If the locator requires a screenshot and a prayer to understand, it is probably not a good test locator.
Special note on dynamic UIs
For component libraries, virtualized tables, and dashboards with repeated controls, I look for row-level anchoring. A good pattern is to scope the search to a section, then locate by role inside that section.
typescript
const customerRow = page.getByRole('row', { name: /Acme Corp/ });
await customerRow.getByRole('button', { name: 'Edit' }).click();
That is much easier to maintain than selecting the third button on the page and hoping the layout stays still.
3) Check waits carefully, because AI loves to over-wait or under-wait
Playwright has strong auto-waiting, but it is not magic. AI-generated tests often add explicit waits in places where they are unnecessary, or skip them where they are actually needed.
Red flags
I usually remove or question:
waitForTimeout(...)- arbitrary sleeps between steps
- waits for selectors that the next action already handles
- repeated waits that hide a slow or unstable app state
A common anti-pattern looks like this:
typescript
await page.getByRole('button', { name: 'Save' }).click();
await page.waitForTimeout(3000);
await expect(page.getByText('Saved')).toBeVisible();
That test is slower than necessary and still not very reliable. A better pattern is to wait on the actual outcome:
typescript
await page.getByRole('button', { name: 'Save' }).click();
await expect(page.getByRole('status')).toHaveText('Saved');
What I want instead
I prefer waits that reflect application state:
- URL changes
- dialog visibility
- network-driven UI updates that resolve into visible state
- meaningful DOM changes tied to the user action
If the test depends on a backend side effect, I sometimes add a dedicated assertion on the resulting state, or use expect.poll() for a bounded condition. The key is to wait for the thing that matters, not a guessed duration.
4) Assertions should prove business value, not just page presence
This is one of the most common weaknesses I see in generated tests. AI can create a flow that clicks through the app, but the assertions end up being shallow.
Weak assertions
These are technically valid but often not enough:
- page title changed
- URL contains a path fragment
- element exists
- toast appeared and disappeared
Better assertions
I look for assertions that answer the question, “What outcome did the user get?”
typescript
await expect(page.getByRole('heading', { name: 'Order confirmed' })).toBeVisible();
await expect(page.getByText('Thank you for your purchase')).toBeVisible();
await expect(page.getByRole('link', { name: 'View receipt' })).toBeVisible();
That is still UI-level, but it is meaningfully closer to the product behavior.
Negative assertions matter too
AI-generated tests often forget to assert the absence of errors. If the flow should succeed, I sometimes add checks like:
typescript
await expect(page.getByText('Invalid password')).not.toBeVisible();
await expect(page.getByRole('alert')).not.toContainText(/error/i);
Use negative assertions carefully, because they can become noisy if the app renders hidden elements or delayed notifications. But for critical paths, they are worth considering.
When text is not enough
If the UI is heavily visual or the assertion depends on broader context, I prefer a more semantic check. In some teams, this is a good place to use Endtest AI Assertions because the validation can be written in plain English instead of hard-coding a brittle string match. That is not a replacement for good Playwright assertions in code, but it can be a cleaner way to express intent when the exact UI text is less important than the behavior.
5) Fixtures and setup should be intentional
AI-generated Playwright code often includes setup that works locally but is awkward in a real suite. I pay close attention to fixtures because they determine how the test scales.
Questions I ask
- Is the test using
beforeEachwhen it should be isolated inside the test? - Does it share state across tests in a way that hides dependencies?
- Is authentication handled through the UI every time, or can it use a stable authenticated state?
- Are helper functions doing too much?
A better pattern for login state
For many suites, I prefer a setup project or storage state file rather than a UI login on every test run. That speeds up the suite and reduces noise from login page changes.
import { test } from '@playwright/test';
test.use({ storageState: ‘auth.json’ });
test('can create a project', async ({ page }) => {
await page.goto('/projects/new');
// test steps here
});
If an AI-generated test repeats login steps in every file, I usually refactor that out unless login itself is the subject of the test.
Be careful with hidden dependencies
A model may create a fixture that seeds data, logs in as a superuser, and assumes the app starts in a clean state. That can be fine for a single test, but in a suite it becomes fragile. Review the assumptions explicitly:
- does the fixture create unique test data?
- does it clean up after itself?
- is the data scoped to the test, not the user account?
6) Cleanup is not optional
This is where generated tests often drift from “demo” into “maintenance trap”. If the test creates data, I want to know how that data is removed or isolated.
What to look for
- API cleanup after UI setup
- unique test data per run
- data reset in test environment
- teardown hooks that run reliably
If the test creates an account, order, ticket, or record, I ask how the next run avoids collisions. A generated test that uses the same email address every time is a classic red flag.
typescript
const email = `test-${Date.now()}@example.com`;
That is better than a hard-coded address, but timestamps can still collide in parallel runs. A unique ID or test-run prefix is usually safer.
import { test } from '@playwright/test';
const email = test-${test.info().workerIndex}-${Date.now()}@example.com;
Even better, if the app supports it, create and delete via API in a helper so the UI test focuses on the behavior under test.
7) Watch for over-abstracted page objects
AI assistants often generate page objects because they look like “good structure.” Sometimes they are. Sometimes they are just indirection for its own sake.
I like page objects when they:
- centralize high-churn locators
- provide a stable API for repeated workflows
- keep tests readable without hiding important actions
I dislike them when they:
- wrap every single click and assertion in a method
- make the test harder to understand than the raw Playwright steps
- hide business meaning behind generic method names like
submitForm()
A test should still read like a user journey. If the page object turns the test into a puzzle, the abstraction is too heavy.
8) Audit the CI behavior, not just local execution
A test that passes locally can still be a bad citizen in CI. This is especially important when reviewing AI-generated Playwright code because models do not know your pipeline constraints unless you make them explicit.
Things I check for CI friendliness
- Does the test assume headed mode?
- Does it rely on a fixed viewport without reason?
- Is tracing or video enabled where debugging will matter?
- Does it use hard-coded timeouts that hide slowness?
- Is the suite safe to run in parallel?
Example CI setup
name: playwright
on: [push, pull_request]
jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npx playwright install –with-deps - run: npx playwright test
That is fine as a starting point, but review the test code for assumptions about shared state, environment data, and the order of execution.
Flakiness signals in generated code
I get nervous when I see these patterns:
- tests depend on previous tests
test.describe.serialis used to paper over shared state- errors are swallowed with
try/catchand no rethrow - assertions are placed too early, before the UI is ready
- retries are added without addressing the cause
Retries are a debugging aid, not a cure. If the code is flaky because of poor locators or bad synchronization, the review should fix the root cause.
9) Ask whether the generated code is actually maintainable
Maintenance is where AI-generated Playwright code either pays off or costs you later. When I review, I look for clues that the test will be easy to update.
Good maintainability signs
- short tests with obvious intent
- reusable helpers for repeated workflows
- locators based on roles and labels
- assertions close to the action they validate
- minimal duplication
Bad maintainability signs
- long linear scripts with many unrelated steps
- many copied selectors
- unclear helper names
- assertions buried far from the action
- magic constants without explanation
A short test with a clear name is often better than a clever abstraction. The goal is not code elegance for its own sake, it is survivable test code.
10) Decide when to rewrite versus accept
Not every AI-generated test needs a full rewrite. My rule of thumb is simple:
- Accept it if the intent is correct, the locators are stable, and the assertions are meaningful.
- Edit it if the structure is good but the details are brittle.
- Rewrite it if the model misunderstood the user journey or embedded too much fragility.
I am especially likely to rewrite when the generated code:
- hard-codes timing assumptions
- uses brittle selectors throughout
- makes shallow assertions
- duplicates setup in every test
- mixes multiple scenarios into one long flow
If you are reviewing Claude Playwright tests or another LLM-generated test batch, it helps to treat them like PRs from a junior engineer who is fast, helpful, and occasionally overconfident. The code deserves respect, but not blind trust.
A practical review rubric I use
Here is the quick rubric I use before approving AI-generated Playwright code:
- Does the test validate a real user outcome?
- Are locators based on accessible, stable signals?
- Does the test avoid arbitrary sleeps?
- Are assertions specific enough to fail for the right reason?
- Is setup isolated and reusable?
- Is cleanup handled or unnecessary by design?
- Will this still be readable in six months?
- Is it safe for CI and parallel runs?
If I cannot answer yes to most of these, the test is not ready, even if it passes.
Where Endtest fits better for some teams
I like Playwright when a team wants code-first control, custom logic, and close integration with a TypeScript or JavaScript stack. But not every team wants to review and maintain full generated test code. In many organizations, the real bottleneck is not authoring a test, it is reviewing, maintaining, and keeping everyone aligned on what the test actually does.
That is why I think Endtest is a stronger fit for teams that want AI-assisted authoring without inheriting as much code maintenance overhead. Its agentic AI Test Creation Agent generates editable, platform-native test steps instead of leaving the team to inspect a large block of source code. You can review the steps, adjust them, and keep the suite in a shared surface that is easier for testers, developers, and managers to reason about.
In other words, if your team is spending too much time reviewing AI-generated Playwright code line by line, a low-code workflow can be the more efficient path. That is also where features like self-healing locators and AI-driven validation become useful, because they reduce the amount of hand-edited maintenance work you carry forward. For teams already comparing options, Endtest’s Playwright comparison is worth a look because it frames the tradeoff clearly: code-first flexibility versus a more reviewable, editable test authoring model.
Final thoughts
To review AI-generated Playwright code well, you have to think like both a tester and a maintainer. The test must express the right user behavior, but it also has to survive real application change. That means paying attention to locators, synchronization, assertions, fixtures, cleanup, and CI behavior, not just syntax.
My default stance is simple: AI can draft the test, but humans should approve the intent, the resilience, and the maintenance cost. When the code is clean, that review is quick. When it is not, the review is the difference between a useful test and another flaky file that gets rerun until the team stops trusting it.
If your team prefers reviewing behavior over reading generated code, agentic AI test platforms can reduce a lot of that friction. For code-first teams, the same review discipline still applies. The only difference is that with Playwright, you own every line.