Mastery
AI-powered learning platform on Google Generative AI
At a glance
| Client | Mastery (EdTech) |
| Industry | Education technology |
| Engagement | 6 months — discovery, prototype, MVP, ongoing iteration |
| Stack | Next.js 14, Google Generative AI (Gemini), Vitest, Docker, TypeScript |
| Status | Live, in production |
The challenge
Off-the-shelf learning management systems can't keep up with the speed of real teaching. One-size-fits-all quizzes don't move the needle on retention. Teachers were spending hours each week writing custom assessment items for their learners, then more hours reviewing performance to figure out what to teach next.
The brief: a learning platform that generates per-learner content and quizzes that are measurably grounded in the syllabus, adapts to each learner's prior performance, and gives teachers a way to audit every generated item.
The hard part was not generating content. LLMs are great at that. The hard part was making the generated content trustworthy at scale — every item linked to source material, every difficulty decision explainable, every teacher able to verify what the system was doing for their learners.
What we built
A multi-tenant Next.js application backed by Google Generative AI (Gemini). Four key pieces:
1. Syllabus ingestion
Teachers upload syllabus documents, course materials, and reference texts. The ingestion pipeline parses, chunks, embeds, and indexes the content into a per-tenant vector store with strict metadata (course → unit → lesson → page).
2. Per-learner content generation
When a learner starts a session, the system pulls their performance history and current syllabus position. The generation engine builds a prompt that retrieves relevant chunks from the syllabus index, conditions on prior performance ("learner struggled with X, ace'd Y"), and produces structured content — explanations, examples, and quiz items — with inline citations to the source material.
3. Adaptive difficulty engine
Quiz items are generated at a difficulty level that targets ~70% success rate per learner. The engine adjusts after each session: too many wrong answers means easier items next time; too many right means stretch items.
4. Teacher eval dashboard
The novel part. Every generated item — every quiz question, every explanation — is auditable. Teachers see:
- The source material it was generated from (link to the exact passage)
- The difficulty target and the actual success rate per learner cohort
- Flag-and-review controls for items the teacher believes are incorrect
- Trend lines showing item quality over time
This is what made the system trustworthy to teachers. Without it, "AI-generated content" felt like a coin flip. With it, teachers could see exactly what the system was doing and intervene.
Architecture
[Teacher] → [Course upload + ingestion]
↓
[Course index (Gemini embeddings + per-tenant vector store)]
↓
[Learner session start] → [Performance history fetch]
↓
[Content generation request]
├─ Retrieve syllabus chunks via vector search
├─ Condition on per-learner performance
├─ Call Gemini with structured output schema
└─ Validate, store, log, render
↓
[Learner takes session] → [Performance updates]
↓
[Teacher eval dashboard] ← all generated items, ratings, sources
Every layer is observable. Every generated item has a stable ID, a source citation list, a difficulty target, and a per-cohort outcome metric.
Key features shipped
- Per-learner content generation grounded in the course syllabus, with inline citations.
- Adaptive quiz difficulty tuned per session to target ~70% success rate.
- Teacher-facing eval dashboard showing every generated item with provenance and outcomes.
- Source citation per item — every claim links back to the source passage in the syllabus.
- Progress retention modelling that predicts which concepts a learner is at risk of forgetting and schedules review.
- Multi-tenant isolation so each institution's syllabus and learner data is fully separated.
- Dockerised deployment with reproducible builds and CI/CD pipelines.
Outcomes
| Metric | Result |
|---|---|
| Difficulty calibration | Per-learner per-session tuning, targeting ~70% success rate |
| Citation coverage | 100% — every generated item traces to source material |
| Deployment | Dockerised, reproducible builds, CI/CD |
| Teacher audit time | ~30 seconds per flagged item (vs hours of manual review) |
| Time-to-content for a new course | From hours-per-week of teacher writing to minutes of system generation |
What we learned
Citation coverage is the trust unlock. Without it, AI-generated learning content is unsellable to schools. With it, the conversation shifts from "is this trustworthy?" to "this is more trustworthy than the textbook because I can see the sources."
Adaptive difficulty is harder than it sounds. Targeting 70% success rate per learner per session requires both per-learner state and good item difficulty estimation. We spent a meaningful portion of the engagement on the calibration loop.
Teacher tooling is the product. The learner-facing experience matters, but the system rises or falls on the teacher tooling. Without the eval dashboard, teachers wouldn't trust the platform. Without the trust, the platform wouldn't get used.
Dockerised reproducibility matters more in EdTech than in most domains. Schools have weird IT constraints, regulatory environments differ by region, and the deployment surface needs to be predictable. Investing in clean Docker + CI/CD up front saved weeks at deployment time.
Stack rationale
We chose Gemini for its long-context handling — relevant for processing entire syllabi as input — and for its strong structured-output support, which we needed for the typed generation schema. Cost per generation was competitive with other providers for the volume we expected.
Next.js 14 App Router for the application. Vitest for testing. Docker for deployment isolation per tenant. Postgres + pgvector for the per-tenant indices. TypeScript end-to-end to keep the small team productive.
Where to go next
If you want the deeper technical breakdown of how we did the retrieval and citation layer, see our RAG patterns post. If you're considering an AI-driven learning or assessment build, drop us a note and we'll come back within a business day.
Related
AI Agents Development
Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.
Custom Development
Web apps, mobile apps, dashboards, internal tools. React, Next.js, React Native, Power Apps — picked for the job, not the hype.
Conversational Agent
Internal or customer chat grounded in your knowledge base with citations and escalation
What is an AI agent? The full breakdown
An AI agent is a system that turns a goal into a sequence of tool calls. Where a chatbot answers questions, an agent completes jobs. It plans steps, picks tools, executes them, recovers from failures, and either finishes the task or hands off to a human. The defining ingredients are a goal, retrieval, tools, guardrails, evals, and observability.
RAG done right: the patterns that survive production
Production RAG is engineering, not magic. The patterns that survive: hybrid retrieval (vector + BM25), rerank top-k with a cross-encoder, metadata filtering, source dating, citation rendering, sampled human review. Without these, your retrieval is good in the demo and broken in production.
ChatGPT API vs Claude API vs Gemini: which to pick (2026)
Claude Sonnet 4.6/4.7 is our default for production agents — most reliable tool calling, best structured output, strong reasoning. GPT-4o wins for voice (Realtime is best-in-class) and the largest ecosystem. Gemini 2.5/2.0 wins for long-context, vision-heavy document work, and cost-sensitive volume workloads. Pick per task; abstract behind a provider interface.
Have a similar problem?
A 30-minute call will tell us if there's a fit. No prep needed — just bring the messy version of the workflow.