Case study · EdTech

Mastery

AI-powered learning platform on Google Generative AI

Client
Mastery
Timeline
6 months
Stack
Next.js 14Google Generative AI (Gemini)VitestDockerTypeScript

At a glance

ClientMastery (EdTech)
IndustryEducation technology
Engagement6 months — discovery, prototype, MVP, ongoing iteration
StackNext.js 14, Google Generative AI (Gemini), Vitest, Docker, TypeScript
StatusLive, in production

The challenge

Off-the-shelf learning management systems can't keep up with the speed of real teaching. One-size-fits-all quizzes don't move the needle on retention. Teachers were spending hours each week writing custom assessment items for their learners, then more hours reviewing performance to figure out what to teach next.

The brief: a learning platform that generates per-learner content and quizzes that are measurably grounded in the syllabus, adapts to each learner's prior performance, and gives teachers a way to audit every generated item.

The hard part was not generating content. LLMs are great at that. The hard part was making the generated content trustworthy at scale — every item linked to source material, every difficulty decision explainable, every teacher able to verify what the system was doing for their learners.

What we built

A multi-tenant Next.js application backed by Google Generative AI (Gemini). Four key pieces:

1. Syllabus ingestion

Teachers upload syllabus documents, course materials, and reference texts. The ingestion pipeline parses, chunks, embeds, and indexes the content into a per-tenant vector store with strict metadata (course → unit → lesson → page).

2. Per-learner content generation

When a learner starts a session, the system pulls their performance history and current syllabus position. The generation engine builds a prompt that retrieves relevant chunks from the syllabus index, conditions on prior performance ("learner struggled with X, ace'd Y"), and produces structured content — explanations, examples, and quiz items — with inline citations to the source material.

3. Adaptive difficulty engine

Quiz items are generated at a difficulty level that targets ~70% success rate per learner. The engine adjusts after each session: too many wrong answers means easier items next time; too many right means stretch items.

4. Teacher eval dashboard

The novel part. Every generated item — every quiz question, every explanation — is auditable. Teachers see:

  • The source material it was generated from (link to the exact passage)
  • The difficulty target and the actual success rate per learner cohort
  • Flag-and-review controls for items the teacher believes are incorrect
  • Trend lines showing item quality over time

This is what made the system trustworthy to teachers. Without it, "AI-generated content" felt like a coin flip. With it, teachers could see exactly what the system was doing and intervene.

Architecture

[Teacher] → [Course upload + ingestion]
                ↓
[Course index (Gemini embeddings + per-tenant vector store)]
                ↓
[Learner session start] → [Performance history fetch]
                ↓
[Content generation request]
   ├─ Retrieve syllabus chunks via vector search
   ├─ Condition on per-learner performance
   ├─ Call Gemini with structured output schema
   └─ Validate, store, log, render
                ↓
[Learner takes session] → [Performance updates]
                ↓
[Teacher eval dashboard] ← all generated items, ratings, sources

Every layer is observable. Every generated item has a stable ID, a source citation list, a difficulty target, and a per-cohort outcome metric.

Key features shipped

  • Per-learner content generation grounded in the course syllabus, with inline citations.
  • Adaptive quiz difficulty tuned per session to target ~70% success rate.
  • Teacher-facing eval dashboard showing every generated item with provenance and outcomes.
  • Source citation per item — every claim links back to the source passage in the syllabus.
  • Progress retention modelling that predicts which concepts a learner is at risk of forgetting and schedules review.
  • Multi-tenant isolation so each institution's syllabus and learner data is fully separated.
  • Dockerised deployment with reproducible builds and CI/CD pipelines.

Outcomes

MetricResult
Difficulty calibrationPer-learner per-session tuning, targeting ~70% success rate
Citation coverage100% — every generated item traces to source material
DeploymentDockerised, reproducible builds, CI/CD
Teacher audit time~30 seconds per flagged item (vs hours of manual review)
Time-to-content for a new courseFrom hours-per-week of teacher writing to minutes of system generation

What we learned

Citation coverage is the trust unlock. Without it, AI-generated learning content is unsellable to schools. With it, the conversation shifts from "is this trustworthy?" to "this is more trustworthy than the textbook because I can see the sources."

Adaptive difficulty is harder than it sounds. Targeting 70% success rate per learner per session requires both per-learner state and good item difficulty estimation. We spent a meaningful portion of the engagement on the calibration loop.

Teacher tooling is the product. The learner-facing experience matters, but the system rises or falls on the teacher tooling. Without the eval dashboard, teachers wouldn't trust the platform. Without the trust, the platform wouldn't get used.

Dockerised reproducibility matters more in EdTech than in most domains. Schools have weird IT constraints, regulatory environments differ by region, and the deployment surface needs to be predictable. Investing in clean Docker + CI/CD up front saved weeks at deployment time.

Stack rationale

We chose Gemini for its long-context handling — relevant for processing entire syllabi as input — and for its strong structured-output support, which we needed for the typed generation schema. Cost per generation was competitive with other providers for the volume we expected.

Next.js 14 App Router for the application. Vitest for testing. Docker for deployment isolation per tenant. Postgres + pgvector for the per-tenant indices. TypeScript end-to-end to keep the small team productive.

Where to go next

If you want the deeper technical breakdown of how we did the retrieval and citation layer, see our RAG patterns post. If you're considering an AI-driven learning or assessment build, drop us a note and we'll come back within a business day.

Related

Service

AI Agents Development

Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.

Read more
Service

Custom Development

Web apps, mobile apps, dashboards, internal tools. React, Next.js, React Native, Power Apps — picked for the job, not the hype.

Read more
Agent type

Conversational Agent

Internal or customer chat grounded in your knowledge base with citations and escalation

Read more
Article

What is an AI agent? The full breakdown

An AI agent is a system that turns a goal into a sequence of tool calls. Where a chatbot answers questions, an agent completes jobs. It plans steps, picks tools, executes them, recovers from failures, and either finishes the task or hands off to a human. The defining ingredients are a goal, retrieval, tools, guardrails, evals, and observability.

Read more
Article

RAG done right: the patterns that survive production

Production RAG is engineering, not magic. The patterns that survive: hybrid retrieval (vector + BM25), rerank top-k with a cross-encoder, metadata filtering, source dating, citation rendering, sampled human review. Without these, your retrieval is good in the demo and broken in production.

Read more
Article

ChatGPT API vs Claude API vs Gemini: which to pick (2026)

Claude Sonnet 4.6/4.7 is our default for production agents — most reliable tool calling, best structured output, strong reasoning. GPT-4o wins for voice (Realtime is best-in-class) and the largest ecosystem. Gemini 2.5/2.0 wins for long-context, vision-heavy document work, and cost-sensitive volume workloads. Pick per task; abstract behind a provider interface.

Read more

Have a similar problem?

A 30-minute call will tell us if there's a fit. No prep needed — just bring the messy version of the workflow.