How AI invoice processing actually works (and where it breaks)
The 90-second version
You receive a stack of invoices. Someone keys them into your ERP. We replace that with code.
A modern AI invoice pipeline:
- Ingest — email, portal, SFTP. Deduplicate via content hash.
- Extract — Claude / GPT-4o / Gemini vision against a typed schema (Zod or Pydantic).
- Validate — schema parses cleanly; line items sum to total within tolerance; tax rates valid.
- Match — query ERP for candidate POs; apply tolerance rules.
- Decide — high confidence → auto-post; medium → human review; low → reject with structured reason.
- Post — write to ERP with original PDF attached, audit log entry, source email marked.
The model is not the hard part. The hard parts are the schema, the reviewer UI, and the eval suite.
Where it breaks
Predictable failure modes, in roughly the order they bite:
Layout variation
Different vendors send different layouts. Some put line items in tables, others in free-form text. Some use abbreviations ("Qty" vs "Quantity"). Some have tax codes that need decoding.
How we handle it: general-purpose vision extraction with a strict Zod schema. The model figures out which field is which from semantic context, not position. For very high-volume vendors with consistent layouts, we layer per-vendor templates on top as an optimization.
Multi-page invoices
Long invoices, often with continuation lines or split totals.
How we handle it: pass all pages to the vision model in one call (most production models handle 50+ page documents reasonably). For very long documents, chunk and run a second-pass merge step.
Tax math that doesn't add up
Vendor invoice says €100 subtotal, €19 tax, €119 total. Schema says quantities × unit prices = subtotal. Math doesn't reconcile. Could be rounding. Could be a missing line. Could be a discount we missed.
How we handle it: tolerance rules (€1 or 0.5% mismatch acceptable for rounding). Beyond tolerance, the invoice goes to review with the discrepancy highlighted.
PO matching across small differences
The invoice says "Wireless Mouse v2" and the PO says "Mouse, wireless, model X-200." Same item. Different description.
How we handle it: vector similarity between invoice line descriptions and PO line descriptions, plus exact match on price within tolerance. The agent picks the most likely PO and surfaces alternatives if confidence is low.
Duplicate invoices
Same vendor, same invoice number, sent twice. Or sent once via email and once via portal.
How we handle it: hash the (vendor, invoice number, total) tuple. Duplicates get caught at ingestion and held with a "possible duplicate" flag for review.
Unknown vendors
A new supplier sends their first invoice. No vendor record exists.
How we handle it: route to a "new vendor" review queue. Reviewer either creates the vendor record (then auto-post going forward) or rejects.
Currency and locale
Decimal separators differ (1,500.00 vs 1.500,00). Date formats vary (DD/MM/YYYY vs MM/DD/YYYY). Currency codes implicit.
How we handle it: explicit currency and locale fields in the schema. Vision LLMs are generally good at inferring from context, but we validate and ask for review on ambiguous cases.
Auth-gated supplier portals
Some vendors send invoices through their portal where you have to log in and download.
How we handle it: where supported, OAuth or API integration. Where not, a credential-managed scraper. Where impractical, manual upload by your AP team kicks off the rest of the pipeline.
Anatomy of a working pipeline
[Email inbox] [Portal upload] [SFTP]
\ | /
[Firestore queue + dedup hash]
↓
[Claude vision extraction → Zod-typed payload]
↓
[Business rules: tax math, PO match, vendor whitelist, dup detect]
↓
[Confidence routing]
├─ high → auto-post to ERP
├─ med → review queue (Next.js admin UI)
└─ low → reject with structured reason
↓
[Audit log + dashboard]
Every box observable. Every transition logged. Every decision reversible.
What you should measure
Don't measure "accuracy" without disaggregating. What we actually track:
| Metric | Why |
|---|---|
| Auto-post rate | % of invoices that skip human review |
| Per-field recall | % of fields correctly extracted, by field |
| Per-field precision | % of extracted fields that are correct |
| Review queue latency | how fast humans clear the queue |
| Cost per invoice (LLM + reviewer time) | the real economic question |
| Cycle time (receipt → posted) | downstream business metric |
| Error escape rate | % of posted invoices later corrected |
Vanity numbers like "99% accuracy" without disaggregation are unfalsifiable.
The reviewer UI matters more than the model
Reviewer ergonomics dictate whether the system actually saves time. A bad reviewer UI means a human spends 3 minutes per flagged invoice; a good one means 30 seconds.
What a good UI has:
- Document on the left, extracted fields on the right.
- Keyboard navigation between fields.
- Click-to-highlight: clicking a field highlights the source region on the document.
- Bulk approval for the queue.
- Per-vendor flagging ("always escalate this vendor's invoices").
- Quick reject with structured reason templates.
We spend meaningful engineering on the reviewer UI. It's the difference between automation that saves the team time and automation that adds work.
Real numbers from a real engagement
From one client (anonymised), processing ~400 supplier invoices per week:
| Metric | Before | After (week 8) |
|---|---|---|
| Human time per invoice | 4–6 min | ~30 sec (review only) |
| % keyed manually | 100% | ~13% |
| Cycle time | 2–4 days | ~4 hours |
| Visible error rate | ~3% estimated | <1% |
| Per-invoice cost (loaded) | €3.80 | €0.18 |
Payback hit at month 4. Full breakdown in the Document Intake Agent case study.
What we won't do
- Skip the schema sprint. Spending the first week defining "what is an invoice for your business" is the highest-leverage step. Skip it and the whole pipeline misses.
- Skip the reviewer UI. Without it, the eventual savings won't materialise.
- Skip the shadow-mode period. Running parallel with the human team for 1-2 weeks catches a class of bugs no eval finds.
- Promise specific accuracy numbers before the prototype phase. Anyone who quotes "99% accurate" pre-prototype is bluffing.
Where to go next
For the buyer's perspective on AP automation more broadly, see AI agents for accounts payable. For the full architecture of how document agents work under the hood, see How AI agents actually work.
If you have an AP volume problem, our Document Processing service page covers the engagement model. Or drop us a note — one paragraph is enough.
Frequently asked questions
Keep reading
AI agents for accounts payable: a deployment guide
AI agents in AP automate the high-volume, low-margin work of invoice keying and PO matching. Honest savings: €3-5 per invoice in loaded cost, 70-90% reduction in human handling time, payback typically 4-8 months on €25-50k builds. The agent isn't the hard part — the reviewer UI and the ERP integration are.
RAG done right: the patterns that survive production
Production RAG is engineering, not magic. The patterns that survive: hybrid retrieval (vector + BM25), rerank top-k with a cross-encoder, metadata filtering, source dating, citation rendering, sampled human review. Without these, your retrieval is good in the demo and broken in production.
How AI agents actually work (under the hood)
An AI agent is a reasoning loop: the model plans, calls a tool, observes the result, replans. Underneath: function-calling APIs, retrieval-augmented context, typed tool schemas, guardrails, evals, and observability. This is the technical breakdown — what each layer does and how they fit together.
AI Document Processing
Invoices, contracts, receipts, forms — extracted, validated, and pushed straight into your system of record.
Document Processing Agent
Invoices, contracts, receipts, and forms → structured data with confidence-tier human review
Want this delivered in your stack?
If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.