Is the model the problem when a chatbot fails?

Almost never in 2026. Models are smart enough now that the model isn't the bottleneck. Bad chatbots fail at retrieval, evals, escalation, observability, or scope. Swapping Claude for GPT or vice versa rarely fixes anything if the surrounding system is wrong.

Can I fix a bad chatbot or do I need to rebuild?

Depends on what's wrong. If retrieval is missing or bad, you can usually retrofit it. If there are no evals, you can add them (and you'll learn how bad the chatbot really is). If the scope is wrong, you may need to rebuild with a clearer focus. Most fixable in place; some need a fresh start.

How long does it take to fix a chatbot?

Audit (one week) tells us what's wrong. Retrofitting retrieval + evals + observability: 2-4 weeks for typical scope. Full rebuild: 4-10 weeks depending on knowledge base size and integration count. The audit is the cheap step that lets you decide whether to fix or rebuild.

Is the model the problem when a chatbot fails?

Almost never in 2026. Models are smart enough now that the model isn't the bottleneck. Bad chatbots fail at retrieval, evals, escalation, observability, or scope. Swapping Claude for GPT or vice versa rarely fixes anything if the surrounding system is wrong.

Can I fix a bad chatbot or do I need to rebuild?

Depends on what's wrong. If retrieval is missing or bad, you can usually retrofit it. If there are no evals, you can add them (and you'll learn how bad the chatbot really is). If the scope is wrong, you may need to rebuild with a clearer focus. Most fixable in place; some need a fresh start.

How long does it take to fix a chatbot?

Audit (one week) tells us what's wrong. Retrofitting retrieval + evals + observability: 2-4 weeks for typical scope. Full rebuild: 4-10 weeks depending on knowledge base size and integration count. The audit is the cheap step that lets you decide whether to fix or rebuild.

All resources

Chatbots

Why your AI chatbot fails (and what to fix)

May 12, 2026· updated May 21, 20265 min read

The six root causes

After auditing dozens of failed chatbots, almost all of them fail for one of these six reasons. The fix in each case is engineering, not prompt tuning.

1. No retrieval

The chatbot is just an LLM with a system prompt. No access to your data. So it hallucinates plausible-sounding garbage when asked about anything specific.

Symptom: it gets generic questions right ("what's the company's mission?") and specific ones wrong ("what's my account balance?").

Fix: add proper RAG. Vector search over your real docs / database. See RAG patterns that survive production.

2. Bad retrieval

The chatbot has retrieval but the chunks are wrong-sized, there's no reranking, no metadata filtering, no hybrid search. Top-3 chunks are irrelevant. The model has nothing useful to work with.

Symptom: it sometimes answers correctly when the right chunk happens to surface, and sometimes invents answers when it doesn't.

Fix: tune chunk sizes per document type. Add BM25 hybrid search. Rerank with a cross-encoder. Filter by metadata. See RAG patterns post.

3. No evals

Every prompt change ships and someone hopes. The bot got better at the question they personally tested, and worse at three others nobody tested. Drift accumulates silently.

Symptom: support tickets about chatbot accuracy keep rising. Nobody can say whether a prompt change last week made things better or worse.

Fix: build an eval set of 50-200 representative questions with reference answers or rubrics. Run in CI. Surface scores on a dashboard.

4. No escalation

The bot tries to handle questions it shouldn't — billing disputes, legal advice, complex troubleshooting, complaints. It infuriates users who just want a human.

Symptom: customer complaints about "your chatbot wouldn't transfer me." High frustration even on otherwise-resolved interactions.

Fix: define explicit escalation intents. Warm transfer with conversation context attached so the human doesn't restart. Outside business hours, queue callbacks.

5. No observability

When something goes wrong, you have no way to know what happened. No per-conversation trace. No cost attribution. No ability to replay.

Symptom: you hear about failures from users, not from your monitoring. Investigations take days.

Fix: Langfuse or equivalent. Every conversation gets a trace. Errors surface in Sentry. Daily dashboard shows volume, success rate, cost, latency.

6. No scope

The chatbot promises to "help with anything." Inevitably someone asks something off-domain (medical advice from an e-commerce bot, legal advice from a support bot) and the LLM helpfully obliges.

Symptom: occasional embarrassing screenshots on social media. Compliance complaints.

Fix: explicit scope in the system prompt. Refusal patterns ("I can help with X and Y. For Z, please contact a human."). Topic classification on user input; off-scope → polite refusal + escalation.

What "fix" looks like in practice

For a typical failed chatbot, the fix is a 4-6 week engagement:

Week	Work
1	Audit: identify which of the six causes apply. Build eval set against the current bot to quantify baseline.
2	Retrieval fix: chunking, hybrid search, reranking, metadata.
3	Evals + observability: CI integration, dashboard, alerting.
4	Scope + escalation: system prompt rewrite, refusal patterns, warm transfer integration.
5-6	Tuning + rollout: phased deploy, monitor metrics, iterate.

Most clients see meaningful improvement (10-30% on accuracy evals, 50%+ on user CSAT) within the first month.

What "rebuild" looks like

When the existing chatbot is unfixable in place — usually because it was built on a no-code platform with no extensibility or because the architecture has rotted past repair — we rebuild. Same principles, fresh start:

Week	Work
1-2	Discovery: knowledge base audit, eval set, scope definition.
3-4	Foundation: retrieval pipeline, observability, evals.
5-7	Conversation flow: tools, escalation, scope refusals.
8-10	Integration + production hardening + phased rollout.

Rebuild costs more (~€25-50k for typical conversational agent) but you end up with a system you can actually operate.

Things that won't fix it

A short list of "fixes" that don't work and waste money:

Swapping the model. GPT vs Claude vs Gemini doesn't fix a missing retrieval layer.
More prompt engineering. Twentieth iteration of the system prompt doesn't compensate for no evals.
Adding more knowledge. Bigger vector store doesn't help if retrieval is broken.
Custom UI redesign. A pretty chatbot that gives wrong answers is still a wrong chatbot.
Buying a new chatbot SaaS. New vendor, same failure modes, longer contract.

The diagnostic test

Want to know if your chatbot is one of the failed ones? Five questions:

Can you see per-conversation traces?
Do you have an eval suite that runs on every change?
Can you tell me the auto-resolution rate without guessing?
Does the chatbot cite its sources?
Does the system have explicit escalation paths to humans?

Three or more "no" answers → you have a failed chatbot. The good news: it's almost always fixable.

Where to go next

For the deeper retrieval engineering see RAG patterns post. For how production AI agents (including conversational ones) are architected see How AI agents actually work. For the buyer's perspective on bringing in an agency see 12 questions to ask.

If you have a chatbot that needs fixing, drop us a note. We'll come back within a business day with an honest take on whether it's a fix or a rebuild.

Frequently asked questions

Keep reading

Article

RAG done right: the patterns that survive production

Production RAG is engineering, not magic. The patterns that survive: hybrid retrieval (vector + BM25), rerank top-k with a cross-encoder, metadata filtering, source dating, citation rendering, sampled human review. Without these, your retrieval is good in the demo and broken in production.

Article

How AI agents actually work (under the hood)

An AI agent is a reasoning loop: the model plans, calls a tool, observes the result, replans. Underneath: function-calling APIs, retrieval-augmented context, typed tool schemas, guardrails, evals, and observability. This is the technical breakdown — what each layer does and how they fit together.

Article

What is an AI agent? The full breakdown

An AI agent is a system that turns a goal into a sequence of tool calls. Where a chatbot answers questions, an agent completes jobs. It plans steps, picks tools, executes them, recovers from failures, and either finishes the task or hands off to a human. The defining ingredients are a goal, retrieval, tools, guardrails, evals, and observability.

Service

AI Agents Development

Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.

Agent type

Conversational Agent

Internal or customer chat grounded in your knowledge base with citations and escalation

Want this delivered in your stack?

If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.

Get a proposal