Playbook

Why most enterprise AI pilots stall in month two

May 6, 2026

A 30-day pilot that ships is the easy part. The client I talk to six weeks later is often holding a working tool that no one is using. The model is fine. The output quality is fine. The workflow owner has stopped opening the spreadsheet.

This is not a scaling problem. It is not a model problem. It is not even, usually, a budget problem. It is a pattern I see repeatedly in mid-market companies between roughly day 31 and day 75 of an internal AI rollout. The pilot was built on scaffolding that only works for 30 days, and nobody noticed until the scaffolding came down.

Three failures account for most of the stalled pilots I get called in to diagnose. Each has a specific fix. None of them are expensive. All of them need to be decided before the 30-day pilot ships, not after it has gone quiet.

Why month 2 is where pilots die, not month 1

The first month of a pilot has its own gravity. There is a project manager. There is a Slack channel. There is a clear deliverable and a date. The workflow owner pays attention because they were told to, and because the thing is new.

Month 2 has none of that. The consultant has moved on or has pulled back to an advisory role. The IT team that helped with the initial data pull has gone back to their normal backlog. The executive who championed the pilot is now focused on the next board update. The workflow owner is alone with the tool, the tool is sitting in a folder, and the reasons the tool was built have started to blur.

What breaks in month 2 is not the technology. It is the operating rhythm. A pilot that was useful under close supervision stops being useful the moment the supervision ends, because the conditions that made it useful were supplied by the pilot itself, not by the workflow. The tool works. The workflow around it was never rebuilt to use the tool.

I read a pattern across the pilots I am asked to resuscitate. They tend to fail for one or more of three specific reasons.

Pattern 1: nobody owns the data refresh

The 30-day pilot ran on a CSV export pulled once, at the start of sprint 2, from SAP or ServiceNow or a custom internal database. The export produced a clean file with 90 days of historical data. The tool worked well on that file.

The export was a one-time pull. Nobody wrote a job to refresh it. Nobody assigned an owner to the refresh. When the workflow owner wants to run the tool on last week's data, there is no last week's data, because last week's data lives in SAP and getting it out requires the IT ticket that took two weeks to schedule the first time.

This is the single most common stall pattern. The pilot shipped on static data. The workflow it was meant to support runs on live data. The gap between those two things was not part of anyone's job description, so it did not get filled.

The fix is to write the refresh job into the original pilot scope. Before the 30-day pilot starts, the workflow owner and the IT contact need to agree on two things. First, where the refreshed export will land and on what cadence. Daily is usually right. Weekly is acceptable for low-frequency workflows. Monthly is almost never acceptable because the workflow owner forgets about the tool between refreshes. Second, who gets paged when the refresh fails. A refresh with no owner fails silently, and a tool with silently failing inputs produces silently wrong outputs until somebody notices, which in a mid-market company is often never.

A refresh job is a 2 to 4 hour build for most source systems. It is not a separate project. Scope it into the pilot or plan for the pilot to stall at week 6.

Pattern 2: there is no accuracy measurement process running

Month 1 output quality is verified by the consultant. The workflow owner reviews a sample of outputs with the consultant in the room, the consultant tunes the prompt, and by day 20 the outputs are good enough to ship. On day 31, the consultant leaves.

The question that nobody has a process for is: are the outputs still good enough? The workflow owner looks at a few outputs the first week, finds them reasonable, and stops checking. Two months later, the model provider ships a minor update that changes how the model handles tables, and the tool starts returning outputs that are subtly worse. Nobody notices for another month, because nobody is checking.

This failure mode is especially common when the workflow owner's previous process was manual synthesis, because manual work carries its own quality signal. They knew when their own output was good. They have no equivalent signal for the AI's output, so they assume it is fine until something breaks loudly.

The fix is to build a measurement routine into the handoff, not to write a measurement tool. A tool is overkill for most mid-market pilots. A routine looks like this. The workflow owner pulls five outputs a week. They review them for accuracy against the source material. They log a single row per review in a shared spreadsheet: date, input name, output quality rating on a 1 to 5 scale, and a one-sentence note about anything wrong. That is the measurement process.

It takes 20 minutes a week. It produces a paper trail that shows whether output quality is drifting. It forces the workflow owner to look at the outputs, which is the thing that keeps the tool alive in their head.

A pilot without a weekly review habit has no way to tell whether the tool is still working. A pilot with one does, and the habit itself is most of what keeps the tool in use.

Pattern 3: the workflow owner never did a side-by-side with their own output

This pattern is less visible than the other two and it does more damage. The 30-day pilot produces AI outputs. The workflow owner looks at the outputs. The workflow owner agrees the outputs look reasonable. The pilot ships.

What did not happen is a side-by-side comparison between the AI output and the workflow owner's own manual output on the same input. The workflow owner never sat down with an input they had previously processed by hand, ran it through the tool, and compared the two outputs line by line.

Without that comparison, the workflow owner does not know which parts of their own judgment the tool is replicating and which parts it is missing. They do not know where they still need to add their expertise. They do not know what to check. They approve the output in the abstract, send it to the next step, and lose track of whether the tool is actually helping or quietly removing the parts of the work they were best at.

Six weeks later, the outputs start to feel off in ways the workflow owner cannot articulate, so they stop using the tool. Not because it is wrong. Because they cannot tell when it is wrong.

The fix is a structured side-by-side before the pilot ships. During the final week of the 30-day pilot, the workflow owner picks three inputs they have previously processed manually and still have the outputs for. They run each through the tool and compare the two outputs side by side. They mark every place the AI output differs from their own, and for each difference they write one line: does the AI version miss something I would have caught, add something I would have cut, or is it a stylistic choice I am fine with.

This is a 90-minute exercise. It produces a personal calibration document that the workflow owner keeps. It tells them exactly where they still need to add judgment and where they can trust the tool. Pilots with this document in month 2 keep running. Pilots without it degrade into vague dissatisfaction within six weeks.

The month 2 diagnostic

If your pilot shipped on day 30 and is stalling on day 50, the question is usually not whether the tool works. The question is whether the three pieces above exist. A named owner on the data refresh. A weekly review habit on the output quality. A side-by-side calibration document in the workflow owner's hands.

If any of the three is missing, that is the thing to build next. Not a better model. Not a prettier UI. The scaffolding the pilot was running on needs to be rebuilt as standing structure.

Work with me

If you have a pilot that shipped and stopped, I run a 60-minute month 2 diagnostic that walks through the three patterns above against your specific tool and decides which one is the active stall. You leave with a short list of what to fix and a rough estimate of the hours each fix takes. Book it at the link below.

Book a 30-minute call