Chatbots

Why your AI chatbot fails (and what to fix)

· updated May 21, 20265 min read

The six root causes

After auditing dozens of failed chatbots, almost all of them fail for one of these six reasons. The fix in each case is engineering, not prompt tuning.

1. No retrieval

The chatbot is just an LLM with a system prompt. No access to your data. So it hallucinates plausible-sounding garbage when asked about anything specific.

Symptom: it gets generic questions right ("what's the company's mission?") and specific ones wrong ("what's my account balance?").

Fix: add proper RAG. Vector search over your real docs / database. See RAG patterns that survive production.

2. Bad retrieval

The chatbot has retrieval but the chunks are wrong-sized, there's no reranking, no metadata filtering, no hybrid search. Top-3 chunks are irrelevant. The model has nothing useful to work with.

Symptom: it sometimes answers correctly when the right chunk happens to surface, and sometimes invents answers when it doesn't.

Fix: tune chunk sizes per document type. Add BM25 hybrid search. Rerank with a cross-encoder. Filter by metadata. See RAG patterns post.

3. No evals

Every prompt change ships and someone hopes. The bot got better at the question they personally tested, and worse at three others nobody tested. Drift accumulates silently.

Symptom: support tickets about chatbot accuracy keep rising. Nobody can say whether a prompt change last week made things better or worse.

Fix: build an eval set of 50-200 representative questions with reference answers or rubrics. Run in CI. Surface scores on a dashboard.

4. No escalation

The bot tries to handle questions it shouldn't — billing disputes, legal advice, complex troubleshooting, complaints. It infuriates users who just want a human.

Symptom: customer complaints about "your chatbot wouldn't transfer me." High frustration even on otherwise-resolved interactions.

Fix: define explicit escalation intents. Warm transfer with conversation context attached so the human doesn't restart. Outside business hours, queue callbacks.

5. No observability

When something goes wrong, you have no way to know what happened. No per-conversation trace. No cost attribution. No ability to replay.

Symptom: you hear about failures from users, not from your monitoring. Investigations take days.

Fix: Langfuse or equivalent. Every conversation gets a trace. Errors surface in Sentry. Daily dashboard shows volume, success rate, cost, latency.

6. No scope

The chatbot promises to "help with anything." Inevitably someone asks something off-domain (medical advice from an e-commerce bot, legal advice from a support bot) and the LLM helpfully obliges.

Symptom: occasional embarrassing screenshots on social media. Compliance complaints.

Fix: explicit scope in the system prompt. Refusal patterns ("I can help with X and Y. For Z, please contact a human."). Topic classification on user input; off-scope → polite refusal + escalation.

What "fix" looks like in practice

For a typical failed chatbot, the fix is a 4-6 week engagement:

WeekWork
1Audit: identify which of the six causes apply. Build eval set against the current bot to quantify baseline.
2Retrieval fix: chunking, hybrid search, reranking, metadata.
3Evals + observability: CI integration, dashboard, alerting.
4Scope + escalation: system prompt rewrite, refusal patterns, warm transfer integration.
5-6Tuning + rollout: phased deploy, monitor metrics, iterate.

Most clients see meaningful improvement (10-30% on accuracy evals, 50%+ on user CSAT) within the first month.

What "rebuild" looks like

When the existing chatbot is unfixable in place — usually because it was built on a no-code platform with no extensibility or because the architecture has rotted past repair — we rebuild. Same principles, fresh start:

WeekWork
1-2Discovery: knowledge base audit, eval set, scope definition.
3-4Foundation: retrieval pipeline, observability, evals.
5-7Conversation flow: tools, escalation, scope refusals.
8-10Integration + production hardening + phased rollout.

Rebuild costs more (~€25-50k for typical conversational agent) but you end up with a system you can actually operate.

Things that won't fix it

A short list of "fixes" that don't work and waste money:

  • Swapping the model. GPT vs Claude vs Gemini doesn't fix a missing retrieval layer.
  • More prompt engineering. Twentieth iteration of the system prompt doesn't compensate for no evals.
  • Adding more knowledge. Bigger vector store doesn't help if retrieval is broken.
  • Custom UI redesign. A pretty chatbot that gives wrong answers is still a wrong chatbot.
  • Buying a new chatbot SaaS. New vendor, same failure modes, longer contract.

The diagnostic test

Want to know if your chatbot is one of the failed ones? Five questions:

  1. Can you see per-conversation traces?
  2. Do you have an eval suite that runs on every change?
  3. Can you tell me the auto-resolution rate without guessing?
  4. Does the chatbot cite its sources?
  5. Does the system have explicit escalation paths to humans?

Three or more "no" answers → you have a failed chatbot. The good news: it's almost always fixable.

Where to go next

For the deeper retrieval engineering see RAG patterns post. For how production AI agents (including conversational ones) are architected see How AI agents actually work. For the buyer's perspective on bringing in an agency see 12 questions to ask.

If you have a chatbot that needs fixing, drop us a note. We'll come back within a business day with an honest take on whether it's a fix or a rebuild.

Frequently asked questions

Keep reading

Want this delivered in your stack?

If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.