If you have asked an AI tool for a Playwright test, you probably know the experience: the first draft looks surprisingly good, then you run it and immediately find the weak spots. Maybe the selectors are too brittle, maybe the assertions are shallow, maybe the test depends on timing that happens to work on your laptop but flakes in CI. That gap between a useful draft and a production-ready test is where most of the real work lives.

This article is a practical AI-generated Playwright tests example, with the kind of review process I use as an SDET. I will start with a plausible generated test, then walk through the fixes needed to make it stable, maintainable, and suitable for CI/CD. I will also show where a platform like Endtest can be simpler, because the AI output lands as editable test steps inside a managed testing platform instead of raw code you have to wire into your own framework.

What AI usually gets right, and what it misses

AI-generated test code is often decent at describing the happy path. It can infer a workflow, produce Playwright syntax, and even add a few assertions. For a brand new test idea, that is useful. The problem is that AI does not know your app’s real stability constraints, release cadence, component behavior, or CI environment.

In practice, generated tests commonly miss these details:

  • selector strategy, especially when the app lacks stable test IDs
  • explicit synchronization around navigation, data loading, and SPA transitions
  • negative assertions, like confirming a success message disappears or a record is saved once
  • setup and teardown, especially seeded test data
  • environment differences, such as feature flags, auth flows, or third-party redirects
  • maintainability, like reusable helpers and test organization

That is why I treat AI-generated tests as a draft, not as an authority.

The best use of AI in Test automation is often acceleration, not autonomy. It can save time on the first version, but you still need engineering judgment to make the test durable.

The scenario we will automate

Let’s use a realistic flow, a user signs in, adds an item to a cart, and checks out.

This is a simple enough path for an AI tool to generate, but it contains enough moving parts to expose the usual problems, auth state, dynamic UI, network timing, and data dependencies.

For context, Playwright is a browser automation library maintained by Microsoft, and its official docs are the best starting point if you are building tests directly in code, Playwright documentation.

A typical AI-generated Playwright test example

Here is the kind of output I often see from a generic prompt like, “Write a Playwright test for logging in and purchasing a product.”

import { test, expect } from '@playwright/test';
test('user can buy a product', async ({ page }) => {
  await page.goto('https://shop.example.com');
  await page.fill('#email', 'test@example.com');
  await page.fill('#password', 'Password123!');
  await page.click('button[type="submit"]');

await page.click(‘text=Products’); await page.click(‘text=Classic T-Shirt’); await page.click(‘text=Add to Cart’); await page.click(‘text=Checkout’);

await expect(page.locator(‘text=Order Confirmed’)).toBeVisible(); });

This is not useless. In fact, it demonstrates the flow clearly. But as production test code, it has problems.

What is wrong with it

  1. Brittle selectors #email, text=Products, and button[type="submit"] may work today, but they are not necessarily stable. Text selectors can break with copy changes, and CSS selectors tied to layout can break during redesigns.

  2. No explicit state verification The test clicks through the flow, but it does not verify that login succeeded before continuing.

  3. No wait strategy Playwright auto-waits for many actions, but it does not magically solve asynchronous app state. Navigation, API-backed UI updates, and SPA transitions can still need stronger synchronization.

  4. No test data control A checkout flow often depends on inventory, payment configuration, user state, or seeded data. The test assumes everything is ready.

  5. No resilience for CI There is no traceability, no retry strategy, and no helper structure for reuse.

Making the test production-ready

The goal is not to over-engineer one test. The goal is to make the test clear enough that it can run reliably in CI and survive normal UI change.

1) Use stable locators

If you own the application, add data-testid attributes where the test needs stable hooks. This is usually better than relying on visible copy or generated DOM structure.

typescript

await page.getByTestId('email-input').fill('test@example.com');
await page.getByTestId('password-input').fill('Password123!');
await page.getByTestId('sign-in-button').click();

If your team has not standardized test IDs, do that first. It will improve every framework you use, not just Playwright.

2) Confirm the expected state after each major step

A flaky test often fails because it assumes the app navigated successfully. Make the state transition explicit.

typescript

await expect(page).toHaveURL(/dashboard/);
await expect(page.getByTestId('account-menu')).toBeVisible();

That tells you whether login succeeded before the test proceeds to commerce steps.

3) Factor out setup into helpers or fixtures

If many tests need an authenticated user, stop repeating the login flow in every test. Use a storage state file or fixture. That shortens the test and reduces the surface area for failures.

import { test as base } from '@playwright/test';

export const test = base.extend({ authenticatedPage: async ({ page }, use) => { await page.goto(‘/login’); await page.getByTestId(‘email-input’).fill(process.env.E2E_EMAIL!); await page.getByTestId(‘password-input’).fill(process.env.E2E_PASSWORD!); await page.getByTestId(‘sign-in-button’).click(); await page.waitForURL(‘**/dashboard’); await use(page); }, });

That fixture is still code, but now the test intent is more focused.

4) Add assertions that matter

A lot of generated tests only assert that a success page exists. That is weak. You want assertions that prove the system behaved correctly.

Examples:

  • confirmation number is present
  • cart total reflects the selected item
  • order appears in the user’s order history
  • inventory count changes where applicable

typescript

await expect(page.getByTestId('order-number')).toHaveText(/ORD-/);
await expect(page.getByTestId('order-status')).toHaveText('Confirmed');

5) Avoid arbitrary sleeps

If AI gives you waitForTimeout(3000), delete it unless you have a very specific reason. Hard sleeps are one of the fastest ways to create fragile tests.

Use semantic waiting instead:

typescript

await page.waitForLoadState('networkidle');
await expect(page.getByTestId('checkout-summary')).toBeVisible();

Even then, be careful. networkidle can be misleading in apps with background polling. Prefer app-specific conditions when possible.

A more realistic version of the test

Here is a revised Playwright example that is cleaner and more production-oriented.

import { test, expect } from '@playwright/test';
test('user can buy a product', async ({ page }) => {
  await page.goto('https://shop.example.com/login');
  await page.getByTestId('email-input').fill(process.env.E2E_EMAIL!);
  await page.getByTestId('password-input').fill(process.env.E2E_PASSWORD!);
  await page.getByTestId('sign-in-button').click();

await expect(page).toHaveURL(/dashboard/); await expect(page.getByTestId(‘account-menu’)).toBeVisible();

await page.getByTestId(‘product-link-classic-tshirt’).click(); await expect(page.getByTestId(‘product-title’)).toHaveText(‘Classic T-Shirt’);

await page.getByTestId(‘add-to-cart-button’).click(); await expect(page.getByTestId(‘cart-badge’)).toHaveText(‘1’);

await page.getByTestId(‘cart-link’).click(); await expect(page.getByTestId(‘checkout-summary’)).toBeVisible();

await page.getByTestId(‘checkout-button’).click(); await expect(page.getByTestId(‘order-confirmation’)).toBeVisible(); await expect(page.getByTestId(‘order-number’)).toContainText(‘ORD-‘); });

This is still a simple test, but now it reads like a real check:

  • log in
  • verify the authenticated state
  • select the product
  • verify the product page
  • add to cart and verify cart state
  • complete checkout
  • verify a meaningful confirmation

That is the difference between generated test code and something you can actually trust in CI.

Common failure modes when you use AI-generated test code

1) The app has no stable locators

If your UI lacks test IDs, the AI may fall back to text selectors, role selectors, or CSS chains. Some of those are fine, but brittle chains like div > div:nth-child(3) are a warning sign.

The fix is not “make the AI smarter”, the fix is to improve testability in the application itself.

2) The test depends on live data

If the test buys the last item in stock, it becomes a data management problem. If it uses a real payment provider, it becomes an environment management problem. If it depends on user-specific state, it becomes an isolation problem.

For stable E2E automation, seed data and test accounts matter as much as the test code.

3) AI generated code misses cleanup

A generated test might create a user, an order, or a draft and never remove it. That is fine for a throwaway demo, but not for a suite that runs on every branch.

Use teardown hooks, API cleanup, or disposable test environments where appropriate.

4) The structure is awkward to maintain

AI can generate one-off tests with duplicated setup. Once you have ten tests that all do the same login flow, the maintenance cost becomes obvious.

That is where page objects, fixtures, or higher-level helpers still matter. AI does not eliminate engineering discipline, it increases the need for it.

A CI/CD checklist for generated Playwright tests

If you plan to run these tests in a pipeline, use this checklist:

  • store credentials in secrets, not inline in test files
  • run with headed mode locally and headless mode in CI
  • use trace/video on failure for diagnostics
  • keep tests independent, so one failure does not cascade
  • tag smoke tests separately from broader regression checks
  • seed test data as part of the pipeline or test environment setup

A minimal GitHub Actions job might look like this:

name: e2e

on: push: branches: [main] pull_request:

jobs: playwright: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npx playwright install –with-deps - run: npx playwright test env: E2E_EMAIL: $ E2E_PASSWORD: $

That is the point where code-generated tests become real engineering assets, or real maintenance debt.

Where Claude or other AI tools fit in

You may see examples labeled as a Claude Playwright example, ChatGPT example, or another AI model variant. The model matters less than the workflow. The useful pattern is consistent:

  1. ask for the test flow
  2. inspect the generated code critically
  3. fix selectors, waits, and assertions
  4. add CI-ready setup
  5. refactor repeated logic

The output quality is determined by your review process, not just the prompt.

When AI-generated Playwright tests are a good fit

They are useful when you need to:

  • bootstrap a new automation project
  • draft a test from a user story quickly
  • convert a manual scenario into code
  • explore a UI flow before refactoring into a framework
  • teach newer engineers how a flow maps to executable steps

They are less useful when you need to:

  • maintain a large cross-functional suite without strong coding ownership
  • allow non-developers to author tests directly
  • minimize framework and CI maintenance
  • standardize test authoring across QA, product, and design teams

That last point is where I think many teams underestimate the cost of code-first automation. Playwright is excellent, but it is still a library. You own the runner, the CI integration, the browser setup, the reporting, and the maintenance surface.

Why a platform like Endtest can be simpler

If your team wants the benefits of AI-assisted test creation without managing generated code, Endtest’s AI Test Creation Agent is worth looking at. The key difference is structural, the agent turns a plain-English scenario into a working Endtest test inside the platform, with steps, assertions, and stable locators already editable in the test editor.

That matters because it changes the maintenance model. Instead of starting with raw Playwright source and then translating it into a sustainable framework pattern, you get a platform-native test that is already organized as editable steps. For teams that want shared authoring across testers, developers, PMs, and designers, that can be a much lower-friction path.

I would summarize the tradeoff like this:

  • Playwright plus AI generation is great if you want code-level control and already have engineering ownership for the suite.
  • Endtest with an agentic AI workflow is simpler if you want the AI output to arrive as editable test steps in a managed platform, with less framework overhead.

If you want a broader comparison, Endtest also publishes a direct Endtest vs Playwright comparison that explains the framework ownership tradeoff in more detail.

My practical decision rule

When I evaluate an AI-generated Playwright tests example, I ask three questions:

  1. Will this test still be understandable six months from now?
  2. Can my team keep it stable without heroic debugging?
  3. Are we choosing code because we need code, or because it is what the AI happened to produce?

If the answer to the third question is “we are choosing it because the model gave it to us”, I slow down. AI output is a starting point. It is not the architecture.

Final takeaway

AI-generated Playwright tests are genuinely useful, but only if you treat them as a draft that needs engineering hardening. The biggest improvements usually come from boring fundamentals, stable locators, explicit assertions, isolated test data, and CI-friendly structure. Those are the things that make a generated test production-ready.

If your team wants raw code and has the discipline to maintain it, Playwright plus AI can work very well. If you want a simpler, more collaborative path where the AI output is already structured as editable test steps inside a platform, Endtest is often the easier option.

Either way, the rule is the same, do not measure the success of AI by how quickly it writes the first test. Measure it by how much reliable testing it helps you ship and keep alive.