Notes

How to evaluate an AI consultant in one phone call: 5 questions that filter for actual builders

May 20, 2026

The AI consulting market has a specific problem right now. There are a lot of people who have done impressive things with AI in personal projects, demos, and blog posts. There are far fewer people who have connected an AI model to production data, handled the edge cases that show up in real use, dealt with the security review, and left working code behind when they walked out.

The questions below are designed to tell those two groups apart in 20 to 30 minutes. They work whether you ever hire me or not. If you are evaluating any AI consultant right now, these are the questions I would ask.

Question 1: Can you describe a production AI system you shipped, including what broke on first real use?

Good answers name a specific product, a specific user base, and at least one thing that did not work correctly on initial deployment. Not a vague "we encountered some challenges." Something like: "The first version of the prompt returned inconsistent output format when the input document was shorter than 200 words; I added a length check and a fallback format instruction and that fixed 90% of the failures."

Bad answers describe demos, prototypes, or tools that were built but never connected to real data. The consultant says things like "we validated the concept" or "the pilot showed strong results" without describing what the tool does today in production.

Why this question works: Every real production AI system has a story about what broke on first contact with real users or real data. If the consultant cannot tell that story, they have not shipped a production system. The story does not need to be a disaster; it just needs to be specific.

What to do if you get a bad answer: Ask directly whether the system is currently in production with real users. If the answer is no, ask what it would take to get it there. The quality of that answer tells you whether the consultant understands production deployment or has only built demos.

Question 2: How did you handle data that should not leave the client's network?

Good answers describe a specific decision about data handling made on a real project. The consultant might say: "The client was on Azure, so we used Azure OpenAI with their existing data processing agreement rather than the direct API. Setup took about a week longer, but all inference stayed within their tenant." Or: "The data was not PII and was already public-facing, so standard API with a DPA was fine. We verified this with the client's legal team before writing any code."

Bad answers either skip the question entirely ("we used the OpenAI API") or give a generic answer about "enterprise-grade security" without describing what was actually decided on an actual project.

Why this question works: Data handling is the first serious question a corporate security team will ask about any AI pilot. A consultant who has done real enterprise work has had this conversation and made a specific decision. A consultant who has only done demos has not.

What to do if you get a bad answer: Ask them to walk you through how they would handle your data specifically. If they cannot describe a decision path that ends with a concrete answer about where your data goes, you are likely talking to someone who will need to figure this out during your engagement.

Question 3: What does your handoff documentation look like?

Good answers describe something specific. A README with setup instructions and environment variable list. A document explaining what the prompt does and what edge cases were identified during testing. A data dictionary for any tables or schemas the AI integration touches. Ideally, the consultant offers to show you an example from a prior project (with the client's name redacted).

Bad answers are vague about handoff, describe handoff as "training the team," or treat documentation as an afterthought separate from the deliverable.

Why this question works: AI systems degrade in specific ways that differ from traditional software. Prompts that worked six months ago may produce different output when the underlying model is updated. Data distributions shift. The handoff documentation is what allows your internal team to diagnose and fix these issues without calling the consultant back. A consultant who does not think about handoff documentation is building dependency, not capability.

What to do if you get a bad answer: Ask whether the engagement fee covers documentation, or whether documentation is separate. If documentation is not included in the standard engagement, that is a negotiation point, not a disqualifier. But the consultant should know where handoff documentation fits in their process.

Question 4: How do you test whether the AI output is accurate enough to trust?

Good answers describe a process for evaluating output quality on real data before shipping. This might be a labeled test set the consultant built manually, a human review workflow during the first week of production, or a specific accuracy threshold the consultant and client agreed on before launch. The key is that there is a process, not just a vibe.

Bad answers describe evaluation entirely in terms of "it looked good" or "the users liked it." Vague positive feedback is not a quality standard. It means nobody measured anything.

Why this question works: AI models are probabilistic. They produce different outputs for similar inputs and they fail on distributions they were not tested on. A consultant who has shipped real AI systems knows this and has a process for managing it. Someone who has only built demos has not had to answer for wrong outputs in production.

What to do if you get a bad answer: Ask specifically how you would know, after the first month of production use, whether the AI output is worse than it was at launch. If the consultant cannot describe a measurement, they have not built a production system.

Question 5: What is the biggest limitation of the AI approach for my specific use case?

Good answers are honest about failure modes. The consultant might say: "RAG-based document search works well for your use case, but it will return plausible-sounding wrong answers when the question is not covered by the indexed documents. You need a human review step for anything high-stakes." Or: "Classification AI works well for your ticket-routing use case, but it will struggle with tickets that span multiple categories. Plan for a 'needs human review' bucket that starts at around 15% of volume and shrinks as you tune the model."

Bad answers try to close the sale by minimizing limitations. "AI is very capable for this" or "the technology is mature enough to handle this" without naming a specific limitation.

Why this question works: Every AI application has a specific failure mode. A consultant who will tell you what it is before you sign is more trustworthy than one who discovers it after your first production incident. How well they describe the limitation also tells you whether they understand the technology they are selling.

What to do if you get a bad answer: Push harder. "I understand AI is capable. What specifically will it get wrong in the first month of production use?" If the consultant still cannot name a specific failure mode, they are not ready to own the production outcome.

Why these questions work as a set

Any one of these questions can be deflected by a well-prepared consultant who has done demos but not production. The set of five is harder to fake because each question asks for specific detail about a specific situation. Demos do not have production data handling decisions. Demos do not have handoff documentation. Demos do not have accuracy measurement processes.

The questions also send a useful signal about you as a client. A company that asks these questions before signing is communicating that they expect production-grade work, not a demo. That attracts consultants who can deliver it and filters out the ones who cannot.

Work with me

If you want to run these questions on me, I will give you direct answers on all five with specific project examples. Book 30 minutes and bring your use case.

Book a 30-minute call