When Codex Hit Its Limit in the Middle of Debugging Our Test Automation Framework

I like AI coding assistants for test automation, but I trust them less after the third retry, the fourth log paste, and the fifth time I need to explain why our framework wraps Playwright fixtures in a custom abstraction.

The moment that made this painfully clear was not a bad answer from Codex. It was hitting the usage limit in the middle of a debugging thread, while the test suite was still red and the context was finally getting useful.

The setup: a Playwright framework that had outgrown its first design

The framework was a fairly typical modern Playwright setup. It started cleanly, then slowly accumulated reality.

We had:

@playwright/test fixtures for authenticated users
Page objects for the most stable flows
API helpers for test data creation
A few visual checks
A CI matrix split by browser and shard
Retry logic configured differently locally and in CI
A custom reporter that posted failure summaries into our internal tooling
Environment-specific base URLs and feature flags
Several tests migrated from older Selenium coverage

Nothing unusual, and that was exactly the problem. Most real test automation frameworks are not unusual. They are layered, historical, and full of tradeoffs.

The issue we were debugging looked simple at first. A group of Playwright tests around account settings started failing intermittently in CI. Locally, they passed. In headed mode, they passed most of the time. In CI, they failed often enough to block releases but not consistently enough to give us one obvious root cause.

The failing assertion was boring:

typescript

await expect(page.getByRole('button', { name: 'Save changes' })).toBeEnabled();

The error was also boring:

Error: Timed out 10000ms waiting for expect(locator).toBeEnabled()
Locator: getByRole('button', { name: 'Save changes' })
Expected: enabled
Received: disabled

Anyone who has maintained browser tests knows this type of failure can mean many things:

The app is still loading account data.
A network request failed.
A validation rule is triggered unexpectedly.
The page object fills fields in the wrong order.
The test data is stale.
A feature flag changes the form behavior.
The selector is technically correct but points to a duplicate control.
The test is racing a debounce or autosave mechanism.

This is where I reached for Codex, not to write a new test from scratch, but to help debug the framework faster.

Why Codex seemed like the right tool at first

For code-level work, an AI coding assistant can be genuinely useful. I can paste a failing test, a fixture, a helper, and a Playwright trace excerpt, then ask for likely failure modes. That is a real productivity boost when the alternative is manually jumping between six files and three CI artifacts.

The first few interactions were useful. Codex helped identify that the test was relying on an implicit assumption: the form became editable only after a permissions request completed, but the helper method that navigated to the page did not wait for that request.

The helper looked roughly like this:

export async function openAccountSettings(page: Page, accountId: string) {
  await page.goto(`/accounts/${accountId}/settings`);
  await expect(page.getByRole('heading', { name: 'Account settings' })).toBeVisible();
}

The test then filled fields immediately:

typescript

test('updates billing contact', async ({ page, account }) => {
  await openAccountSettings(page, account.id);
  await page.getByLabel('Billing contact').fill('qa-billing@example.com');
  await expect(page.getByRole('button', { name: 'Save changes' })).toBeEnabled();
  await page.getByRole('button', { name: 'Save changes' }).click();
  await expect(page.getByText('Settings saved')).toBeVisible();
});

Codex suggested waiting for the permissions response. That was a reasonable hypothesis:

export async function openAccountSettings(page: Page, accountId: string) {
  const permissionsResponse = page.waitForResponse(response =>
    response.url().includes(`/accounts/${accountId}/permissions`) &&
    response.status() === 200
  );

await page.goto(/accounts/${accountId}/settings); await permissionsResponse; await expect(page.getByRole(‘heading’, { name: ‘Account settings’ })).toBeVisible(); }

I do not like blindly waiting for responses everywhere, because it can couple tests to implementation details. But in this case, permissions were a real prerequisite for the UI state. Waiting for the app to know whether the user can edit was more meaningful than adding a sleep or increasing the assertion timeout.

A good wait is not just a delay. It is a statement about the product state the test actually needs.

So far, good.

Then the failures continued.

The debugging loop got longer and more expensive

Once the obvious fix did not solve it, the conversation with Codex became less like code generation and more like pair-debugging a distributed system.

I needed to provide:

The failing test
The page object methods
The fixture that created the account
The API helper that assigned permissions
CI logs from multiple failed runs
A trace excerpt
Retry behavior
The Playwright config
The difference between local and CI environment variables
A screenshot description of the disabled button state
The recent diff that introduced the failure

This is where the Codex test automation limit became practical, not theoretical. Test automation debugging often requires long context. The assistant needs to understand the test framework, the product behavior, the data setup, the selectors, the CI environment, and the failure artifacts. You do not get a reliable answer from one prompt unless the problem is trivial.

A normal debugging loop looked like this:

Paste the failing test and error.
Get a hypothesis.
Paste fixture code.
Get a refined hypothesis.
Paste CI logs from another run.
Discover a conflict with the first hypothesis.
Paste Playwright trace details.
Ask for a minimal instrumentation change.
Run CI again.
Paste new logs.

That is not abuse of an AI assistant. That is what real test automation debugging looks like.

The problem is that every iteration consumes context, tokens, tool calls, and whatever usage accounting the coding assistant applies. Codex usage limits are not just a billing or subscription detail when you are in the middle of release work. They become part of the engineering workflow.

In this case, the assistant hit its limit just when the thread had accumulated enough detail to be valuable.

The failure was not one bug, it was three small assumptions

The eventual root cause was not dramatic. It was a stack of small assumptions that made the suite fragile.

Assumption 1: the permission API was enough

Waiting for the permissions response helped but did not guarantee the UI had applied the permission state. The response returned before the client-side state update finished. The button was still disabled for a short window.

A better wait was user-visible and tied to the intended behavior:

typescript

await expect(page.getByLabel('Billing contact')).toBeEditable();

This expressed what the test actually needed: the form must be editable before filling it.

Assumption 2: the test data helper created a partially valid account

The test account factory created accounts quickly, but one optional field was omitted. In production, that field was populated during onboarding. In tests, it sometimes stayed blank. The account settings page treated that missing field as an incomplete profile state, which kept the save button disabled until another background request resolved.

The helper looked innocent:

export async function createAccount(api: APIRequestContext) {
  const response = await api.post('/test/accounts', {
    data: {
      plan: 'pro',
      status: 'active'
    }
  });

return response.json(); }

We changed it to make the default account closer to a real editable account:

export async function createAccount(api: APIRequestContext) {
  const response = await api.post('/test/accounts', {
    data: {
      plan: 'pro',
      status: 'active',
      billingCountry: 'US',
      contactEmail: 'owner@example.com'
    }
  });

return response.json(); }

The important lesson was not the fields themselves. It was that test data defaults are framework behavior. They deserve the same review as selectors and assertions.

Assumption 3: retries were hiding separate failure modes

The Playwright config had retries enabled in CI:

export default defineConfig({
  retries: process.env.CI ? 2 : 0,
  workers: process.env.CI ? 4 : undefined,
  use: {
    trace: 'on-first-retry',
    screenshot: 'only-on-failure'
  }
});

That is a common and reasonable setup. The problem was that first attempts failed for two different reasons:

Sometimes the form was not editable yet.
Sometimes the test account was incomplete.

On retry, the same account state occasionally changed enough to pass, depending on background processing. The retry made the suite look flaky instead of deterministically broken.

Codex was helpful in spotting individual pieces, but the full diagnosis required a lot of context and multiple runs.

Where AI coding assistant limits hurt test automation specifically

AI coding assistant limits are annoying in any development work. They hurt test automation in a specific way because tests are context-heavy.

A unit test failure often fits in a small prompt: function, expected output, actual output. Browser automation failures usually do not.

For a Playwright or Selenium framework, the relevant context can include:

The test file
The page object
Shared fixtures
Test data setup
Authentication state
Network mocks or API helpers
Browser differences
CI resource constraints
Trace artifacts
Screenshots and videos
Locator strategy
Product state and permissions
Framework conventions

The assistant is not only answering, “Why did this assertion fail?” It is trying to infer how the whole automation stack behaves.

That is why Codex Playwright tests can be fast to generate but still expensive to maintain through an AI coding loop. Creating the first version of a test is often the easy part. Maintaining it through UI changes, data changes, retries, and CI differences is where the cost shows up.

The limit I hit was not just, “I cannot ask more questions for a while.” It was:

The debugging context was trapped in a thread I could not continue effectively.
The next assistant session would require re-explaining the framework.
Some useful intermediate reasoning was no longer actionable.
The team still needed the build fixed.

That is the operational risk of depending too heavily on a general coding assistant for test maintenance.

A practical pattern: use AI for hypotheses, not ownership

After that experience, I changed how I use AI coding assistants for test automation debugging.

I still use them, but I avoid making the assistant the only place where the debugging state exists.

Keep a local debugging note

For non-trivial failures, I maintain a short note in the repo or issue tracker:

text

Failure: account settings save button disabled

Observed

CI only, chromium shard 2
Fails waiting for Save changes to be enabled
Retries sometimes pass

Ruled out

Button selector is correct
Heading is visible before fill
Permissions API returns 200

Suspects

Form state not applied after permissions response
Incomplete account fixture
Background validation request

Changes tested

Wait for permissions response, reduced but did not eliminate failures
Added editable assertion on Billing contact
Updated account factory with billingCountry and contactEmail

This prevents the AI chat from becoming the source of truth.

Ask for small patches, not framework rewrites

When an assistant has partial context, big refactors are risky. I prefer prompts like:

“Given this helper and failure, suggest one additional observable wait.”
“Review this fixture for data assumptions that could affect form editability.”
“Suggest logging that would distinguish permission delay from validation delay.”

That usually produces better results than asking it to redesign the page object model.

Add instrumentation that survives the chat

Instead of repeatedly pasting logs, I try to improve the test output itself.

For example:

typescript

async function logSaveButtonState(page: Page) {
  const button = page.getByRole('button', { name: 'Save changes' });
  console.log('Save button state', {
    visible: await button.isVisible().catch(() => false),
    enabled: await button.isEnabled().catch(() => false),
    billingEditable: await page.getByLabel('Billing contact').isEditable().catch(() => false),
    url: page.url()
  });
}

This is not elegant framework code, and I would not leave all of it permanently. But targeted instrumentation can make the next failure easier for a human or an assistant to diagnose.

Prefer behavior waits over arbitrary network waits

Playwright gives you strong locator and assertion primitives. The official Playwright documentation is worth revisiting when a framework starts accumulating custom waiting logic.

In this case, the best final wait was not a timeout and not only a response wait. It was a user-observable readiness condition:

typescript

await expect(page.getByLabel('Billing contact')).toBeEditable();
await expect(page.getByRole('button', { name: 'Save changes' })).toBeDisabled();

Then after modifying the field:

typescript

await page.getByLabel('Billing contact').fill('qa-billing@example.com');
await expect(page.getByRole('button', { name: 'Save changes' })).toBeEnabled();

This made the test read more like the product behavior.

The more a test reads like user behavior, the less it depends on private timing details.

The commercial question: should a team pay for more AI coding capacity?

For a CTO or QA leader, the first reaction might be simple: upgrade the plan, increase the limit, or buy more seats. Sometimes that is the right move. If your team spends hours per week writing and reviewing test code, AI coding assistant capacity can pay for itself quickly.

But the better question is not only, “Do we need more Codex usage?”

The better question is, “Why does every test automation change require a long AI coding loop in the first place?”

If your test strategy depends on engineers repeatedly asking a general coding assistant to update selectors, rewrite fixtures, adjust waits, fix page objects, and interpret flaky traces, you may be optimizing the wrong layer.

There are three costs to consider.

1. The token cost is not the real cost

Usage limits are visible, but engineering interruption is more expensive. A blocked debugging session means someone has to rebuild context later. That context switch hits SDETs, developers, and release managers.

2. The framework knowledge is fragile

A coding assistant only knows what you provide in the current context. Your team knows that createAccount() must include billing defaults, that admin users are feature-flagged differently, and that Firefox runs slower in one CI pool. Unless that knowledge is encoded in the framework or documentation, the assistant has to rediscover it.

3. Generated code still becomes your code

If Codex generates a Playwright helper, your team owns it. You own the locator strategy, retry behavior, abstractions, naming, fixtures, and failure output. That is fine for teams with strong SDET ownership. It is risky for teams that wanted AI to reduce maintenance but ended up with more code to maintain.

Why purpose-built AI test creation is different

This is where tools like Endtest, an agentic AI test automation platform with low-code/no-code workflows, deserve serious evaluation, especially for teams that are hitting AI coding assistant limits while trying to maintain browser automation.

Endtest is not just another way to ask an assistant for Playwright code. Its AI Test Creation Agent is purpose-built for test automation. You describe a scenario in plain English, and it generates a working end-to-end test inside the Endtest platform, with editable platform-native steps, assertions, and locators. The important part is that the output is not a blob of Playwright, Selenium, JavaScript, Python, or TypeScript source code that your team must wire into a custom framework. The generated test lands as Endtest steps that testers and engineers can inspect, edit, and run.

That distinction matters.

With a general coding assistant, the workflow is often:

Ask for Playwright or Selenium code.
Paste it into your framework.
Fix imports, fixtures, auth, and selectors.
Run locally.
Debug failures.
Paste errors back into the assistant.
Repeat until the usage limit or patience runs out.

With an agentic AI test automation platform with low-code/no-code workflows such as Endtest, the goal is different:

Describe the user behavior.
Let the AI Test Creation Agent build editable test steps.
Review and adjust the steps in the test editor.
Run the test in the platform.
Maintain the test at the step and locator level instead of constantly modifying framework code.

That does not mean every team should abandon Playwright. I like Playwright. It is powerful, developer-friendly, and excellent for teams that want full control. But if the business goal is reliable end-to-end coverage rather than owning a bespoke automation framework, it is fair to compare the maintenance model. Endtest has a direct comparison page for teams evaluating Endtest versus Playwright, and it is worth reading if your current suite requires constant engineering attention.

The same applies to older Selenium stacks. Selenium is still widely used and well documented in the official Selenium documentation, but many Selenium frameworks carry years of custom waits, wrappers, driver management, and page object conventions. If your team is using a coding assistant mostly to patch that maintenance burden, the Endtest documentation on migrating from Selenium is a useful place to start.

Endtest also addresses two common maintenance pain points directly. Self Healing Tests can help when UI changes affect locators, and Visual AI can support visual validation without turning every visual check into custom framework code. The corresponding documentation for Self Healing Tests and Visual AI is worth reviewing during evaluation.

A realistic comparison: Codex plus Playwright versus Endtest

Here is the way I would frame the tradeoff for an engineering manager.

Codex plus Playwright is strong when you need code-level control

Use this path when:

Your SDETs are comfortable owning TypeScript or JavaScript.
Your app needs custom test utilities and API-level setup.
You want tests versioned with application code.
You need tight integration with developer workflows.
You are willing to maintain fixtures, page objects, reporters, and CI behavior.

In this model, Codex can accelerate implementation. It can suggest locators, refactor helpers, explain errors, and generate boilerplate. But the framework remains yours. When the assistant reaches a limit, your team must still be able to debug without it.

Endtest is strong when the bottleneck is test maintenance and authoring

Use this path when:

Your team wants more people contributing to test coverage, not only SDETs.
You do not want every test change to become a code review.
Your test maintenance burden is dominated by UI flows, selectors, and assertions.
You want AI-created tests that remain editable in a dedicated test platform.
You want to reduce dependency on long AI coding conversations for routine test updates.

The strongest argument for Endtest is not that code is bad. It is that end-to-end testing has a lot of repetitive framework work that many teams do not actually want to own.

The key question is not whether your team can maintain a custom test framework. It is whether maintaining that framework is the best use of the team’s time.

What I would do differently next time

If I were approaching the same debugging session again, I would still use Codex, but with stricter boundaries.

First, reduce the problem before involving AI

I would isolate the failing test and run it with trace enabled:

bash npx playwright test tests/account-settings.spec.ts –project=chromium –trace=on

Then I would confirm whether the failure is selector-related, data-related, or timing-related before pasting anything into an assistant.

Second, create a minimal context packet

Instead of feeding the assistant the whole framework gradually, I would prepare a compact packet:

text Goal: debug why Save changes remains disabled in CI.

Relevant behavior:

Account settings page loads with form disabled.
Permissions request determines editability.
Save button enables only after a field changes and form is valid.

Relevant files:

test: account-settings.spec.ts
helper: openAccountSettings()
fixture: createAccount()
config: retries 2 in CI, 4 workers

Observed:

Local pass
CI intermittent fail
Retry sometimes passes

This reduces wasted iterations and makes it easier to restart if the assistant limit is reached.

Third, turn discoveries into framework improvements

The final fix should not live only in one test. If account settings always requires editability before interaction, the page object or helper should encode that:

export async function waitForAccountSettingsReady(page: Page) {
  await expect(page.getByRole('heading', { name: 'Account settings' })).toBeVisible();
  await expect(page.getByLabel('Billing contact')).toBeEditable();
}

Then tests can be boring:

await openAccountSettings(page, account.id);
await waitForAccountSettingsReady(page);

Boring tests are good tests.

Fourth, review whether the framework is still worth owning

This is the uncomfortable part. If the team repeatedly needs AI assistance to maintain the framework, it may be a signal that the framework has become a product of its own.

Owning a test automation framework means owning:

Architecture
Dependencies
CI scaling
Browser upgrades
Selector conventions
Test data lifecycle
Reporting
Flake management
Onboarding
Documentation

For some teams, that ownership is strategic. For others, it is accidental.

The lesson: AI limits expose automation design problems

The Codex test automation limit did not create the flaky test. It exposed a workflow problem.

We had allowed too much debugging context to accumulate inside an AI conversation. We also had framework assumptions that were not encoded clearly enough in tests, helpers, or documentation. When the assistant became unavailable, the remaining work was harder than it needed to be.

That is the main lesson I took from the experience:

AI coding assistants are useful for test automation, but they are not a substitute for maintainable test design.
Playwright and Selenium frameworks need explicit conventions around waits, data, fixtures, and logging.
Usage limits matter most when your workflow requires long, iterative debugging sessions.
Purpose-built AI testing platforms can be a better fit when the goal is reliable coverage without constant framework maintenance.

For teams choosing a direction, I would not frame this as “AI coding assistant versus test platform” in a simplistic way. Many teams will use both. A developer may still use Codex for API helpers and test utilities, while QA and product teams use Endtest for core user journeys. The right mix depends on who owns quality, how often the UI changes, and how much framework code the organization wants to maintain.

But I am much more cautious now about putting critical debugging flow inside a limited AI coding loop. When the test suite is red, the release is waiting, and the assistant says you have hit the limit, you find out very quickly whether your automation strategy is resilient or just temporarily accelerated.