How We Cut Data Entry Time by 87%: Inside the AI Document Processing Pipeline We Built for a Logistics Firm
February 19, 2026
Artificial Intelligence
Knowledge workers spend an average of 1.8 hours every single day manually entering data from documents into business systems — that's nearly 25% of their entire workday, according to McKinsey Global Institute research. Across a 25-person operations team, that number becomes catastrophic: roughly 45 person-hours consumed daily by work that produces zero strategic value.
That was exactly the situation facing a logistics company that came to us in mid-2025. Their team of 32 operations staff was manually processing supplier invoices, freight manifests, and customs declarations — three distinct document types, each with its own format quirks, across dozens of vendors. They'd already tried an off-the-shelf OCR tool. It failed spectacularly. So they called KumoHQ.
What followed was a 14-week build that cut their data entry time from 40+ hours per week to under 5. Here's the full technical breakdown — architecture, stack, implementation phases, and the mistakes we made along the way.
Why the Client's Previous Solution Failed
Before we built anything, we spent two weeks understanding why the previous OCR tool had been shelved. The answer was instructive: static tools fail on dynamic documents.
The client received invoices from 60+ vendors. Some sent PDFs. Some sent scanned images. Some sent Excel sheets exported as PDFs. A handful still faxed (yes, in 2025). Field positions varied wildly — "Total Amount Due" might appear in the top right on one vendor's template and the bottom left on another's. Date formats ranged from DD/MM/YYYY to "Fifteenth of January" written out in full.
A rules-based OCR engine is built on the assumption that document structure is consistent. The moment you hand it 60 vendors with 60 slightly different templates, it starts hallucinating values, misreading fields, or skipping rows entirely. The client's previous tool had an accuracy rate of about 71% — which meant a human still had to check every single document. They got none of the time savings and all of the licensing cost.
The fix wasn't a better OCR tool. It was a fundamentally different approach: build a system that understands what it's reading, not one that pattern-matches against fixed templates.
The Architecture: An AI Extraction Pipeline
We designed a three-layer pipeline that processes each document through increasingly intelligent stages:
Layer 1: Document Ingestion + Normalisation
Every incoming document — regardless of source — first passes through a normalisation layer. PDFs are rendered to high-resolution images using pdf2image. Excel and CSV files are converted to structured text. Scanned images are upscaled and denoised using OpenCV before any extraction happens.
This layer also handles routing: the system classifies each document as an invoice, freight manifest, or customs declaration using a lightweight classifier trained on 2,000 historical examples. That classification determines which extraction prompt and schema gets applied downstream.
Layer 2: LLM-Powered Extraction
This is the core of the system. Each normalised document is passed to Claude 3.5 Sonnet with a carefully engineered prompt that instructs it to extract specific fields and return structured JSON. We chose Claude specifically because of its superior ability to reason about ambiguous field placement and handle edge cases — like a "Remarks" field that sometimes contains tax codes buried in free text.
The prompt includes:
A field schema (what we're looking for, data types, acceptable formats)
Explicit handling instructions for common ambiguities (e.g., "if you see two amounts, extract the one labelled 'Grand Total' or 'Amount Due', not subtotals")
A confidence scoring instruction — the model flags any field it's less than 85% confident about
For documents with very poor image quality (legacy faxes, primarily), we added a second pass using GPT-4o Vision as a backup, comparing outputs and escalating disagreements to human review.
Layer 3: Validation + Human-in-the-Loop
Extracted JSON passes through a rules engine that validates against known constraints: amounts must be positive numbers, dates must be parseable, vendor codes must exist in the client's ERP master data. Anything that fails validation — or that the LLM flagged as low-confidence — routes to a lightweight review interface where a human can confirm or correct the value in under 30 seconds.
Once approved (manually or automatically), the record is written directly to the client's ERP via API. No copy-pasting. No spreadsheet intermediaries.
The Technical Stack
Here's exactly what we used and why each choice was made:
Layer | Tool / Technology | Why We Chose It |
|---|---|---|
Document Ingestion | Python + pdf2image + OpenCV | Battle-tested, highly controllable image pre-processing |
Document Classification | Fine-tuned DistilBERT | Fast, cheap, accurate on domain-specific doc types |
Primary Extraction | Claude 3.5 Sonnet (via Anthropic API) | Best reasoning on ambiguous layouts; structured JSON output |
Backup Extraction | GPT-4o Vision | Cross-check for low-quality scans; different failure modes |
Validation Engine | Python + Pydantic | Type-safe schema validation, clean error messages |
Review Interface | React + FastAPI | Lightweight, deployed internally; no external SaaS dependency |
Workflow Orchestration | n8n (self-hosted) | Visual pipeline management, easy to modify triggers without code changes |
ERP Integration | REST API + SAP connector | Client's existing ERP; direct write via authenticated API |
Storage + Audit Trail | PostgreSQL + S3 | Documents stored in S3; extracted data + review history in Postgres |
One architectural decision worth highlighting: we deliberately avoided building this as a single monolithic service. Each layer runs as an independent service with a defined API contract between them. That meant when the client wanted to add a fourth document type (delivery receipts) eight weeks into the project, we could extend the classification model and add a new extraction prompt without touching the validation or ERP integration layers.
Implementation: 4 Phases Over 14 Weeks
Phase 1: Data Collection + Baseline (Weeks 1–2)
We collected 500 historical documents of each type from the client and manually labelled the correct extraction outputs. This became our ground-truth dataset for both training the classifier and evaluating extraction accuracy. We also benchmarked the client's existing manual process: how long did it take a trained ops staff member to process each document type? (Answer: 8 minutes for invoices, 12 minutes for manifests, 18 minutes for customs declarations.)
Phase 2: Extraction Pipeline MVP (Weeks 3–7)
We built the core extraction pipeline and tested it against our labelled dataset. Initial accuracy on invoices was 91%. Manifests came in at 88%. Customs declarations — the most complex — sat at 79%, mostly failing on commodity description fields that mixed product names with HS codes. We iterated on the prompt engineering over two weeks and got that figure up to 94%.
Phase 3: Validation + Review Interface (Weeks 8–11)
Building the review interface revealed something important: the human review experience matters enormously for adoption. We initially designed a table-based UI showing all extracted fields at once. The ops team found it overwhelming. We redesigned it as a side-by-side view — original document on the left, extracted fields on the right — with only flagged fields highlighted for review. Review time per document dropped from an average of 4 minutes to 47 seconds.
Phase 4: ERP Integration + Go-Live (Weeks 12–14)
The ERP integration was the most technically fiddly part of the project. The client's SAP instance required field values in specific formats that differed from how the LLM naturally returned them. We built a mapping layer — essentially a translation dictionary — that converts extracted values to ERP-ready formats before writing. After two weeks of parallel running (processing documents manually and through the system simultaneously, then comparing outputs), we went live.
Results After 90 Days
Here's what the numbers looked like at the 90-day post-launch review:
Data entry time: 40+ hours/week → 4.8 hours/week (87% reduction)
Overall extraction accuracy: 96.2% (up from 71% with the previous OCR tool)
Human review rate: 18% of documents require any human review (down from 100%)
Average review time per flagged document: 51 seconds
ERP data errors: Down 94% compared to manual entry baseline
Payback period on development cost: 6.5 months (based on ops staff time saved)
The client was able to redeploy two full-time staff members from data entry to exception handling and vendor relationship management — roles that had previously been understaffed. The ops manager's words after 90 days: "The team doesn't talk about the system anymore. It just works, and they focus on actual work."
What We'd Do Differently
Three lessons we'll carry into the next build:
1. Build the review interface first, not last. We treated the review UI as an afterthought — something to build after the extraction pipeline was solid. That was a mistake. The UI shapes how humans interact with the system's outputs, which directly affects adoption. In future projects, we prototype the review experience in week one.
2. Don't underestimate the ERP integration timeline. We estimated two weeks for ERP integration and it took four. Legacy ERP connectors have underdocumented quirks. Buffer accordingly.
3. Invest in the confidence scoring mechanism early. Our LLM confidence flags were rudimentary in phase two — essentially asking the model "are you sure?" We've since developed a more sophisticated calibration approach that compares LLM confidence scores to actual accuracy rates across document types. The result is a much better signal for which documents truly need human review versus which ones the model is unnecessarily flagging.
Is an AI Document Processing System Right for Your Team?
This type of system makes strong economic sense when:
Your team processes 50+ documents per week of similar types
You're dealing with multiple vendor formats that break rules-based tools
Data entry errors are creating downstream operational problems
You have an existing system (ERP, CRM, database) that you want to feed data into automatically
It's probably not the right fit if your document volume is low (under 20/week) or if your documents are already highly standardised and a cheaper template-based tool would handle them reliably.
Not sure which camp you're in? Talk to our team — we'll give you an honest assessment, not a sales pitch.
Frequently Asked Questions
How accurate is AI document processing compared to manual data entry?
In our experience, a well-built AI extraction pipeline achieves 94–97% accuracy on semi-structured documents, versus roughly 98–99% for careful manual entry. The difference sounds small, but the AI processes documents in seconds rather than minutes, so the error rate per hour of work is dramatically lower. More importantly, AI errors are consistent and auditable — you know exactly which fields get flagged for review, so residual errors don't slip through undetected the way manual errors do.
What happens when the AI can't read a document correctly?
That's precisely what the human-in-the-loop layer handles. Any document where the AI has low confidence on one or more fields gets routed to a review queue. A human sees the original document alongside the extracted values and can confirm, correct, or flag for further investigation — typically in under a minute. Nothing enters your ERP until it's been validated, either automatically or by a human reviewer.
How long does it take to build and deploy a system like this?
For a three-document-type pipeline like the one described in this case study, expect 12–16 weeks from kickoff to production. Simpler single-document systems can be delivered in 6–8 weeks. Timeline is largely driven by data collection, prompt engineering, and the complexity of your target system integration (ERP, CRM, database). We always run a parallel period — running both old and new processes simultaneously — before fully switching over.
Do I need to host this on my own servers, or does it run in the cloud?
Both are viable. For clients with strict data residency requirements (common in logistics and healthcare), we deploy the entire pipeline on their own cloud infrastructure (AWS, GCP, or Azure). For clients without those constraints, we use a managed cloud setup that's faster to deploy and easier to maintain. The LLM API calls go to Anthropic/OpenAI regardless, but documents themselves never need to leave your infrastructure.
What's the typical cost to build something like this?
Project costs vary with document complexity, number of document types, and the target system you're integrating with. As a rough benchmark, a single-document-type extraction pipeline with ERP integration typically runs between $18,000 and $35,000 for the build. Ongoing costs are primarily LLM API usage — for the client in this case study, that runs to approximately $180/month at their current volume. Want a more specific estimate for your situation? Get in touch and we'll put together a scoping proposal.
Build Your AI Document Pipeline with KumoHQ
We've built AI extraction systems for logistics companies, healthcare providers, financial services firms, and manufacturing operations. Every project teaches us something new — and every lesson improves the next build.
If your team is spending hours every week on data entry that should be automated, let's talk. We'll map out what a system for your specific documents and data sources would look like, and give you an honest timeline and cost estimate — no obligation, no upsell.
