B Babelio Playbook · 20
AI Addendum
Engineering onboarding · The AI parts a SaaS playbook skips

AI addendum

The five AI-specific things a new engineer must understand in one read: what each translated minute costs us, how evals gate every deploy, why a provider can't lock us in, what telemetry compounds into a moat, and how prompts are versioned. Numbers inherited from tech.md §8 as the single source of truth — not re-derived.

Dub COGS
$0.50/active-hr
Subtitle COGS
$0.31/active-hr
Eval bar
<700ms
Lock-in hedge
LiteLLM
Read this if you're new Babelio is not a RAG or agent app — it's a fixed STT → MT → TTS streaming pipeline with a <700ms latency budget. There's no agent loop to reason about. The "AI" you operate is three swappable provider legs behind a proxy, gated by evals, instrumented for a telemetry flywheel. Everything below is operational: cost cells you can recompute, a procedure you can run, a CI gate you can read.

1 · AI cost-of-goods model

Purpose: every engineer can name what a translated minute costs and which leg dominates. Bar: you can recompute the per-active-hour number from the cells below without asking the founder. Refresh: quarterly + on any provider price change

The make-or-break insight: STT is ~60% of dub COGS, not TTS. Anyone who tells you "TTS is 80% of the bill" is reading a stale doc. Per-leg unit prices (May 2026): Deepgram Nova-3 multilingual $0.0092/min, Cartesia Sonic $0.006/min, Gemini 2.5 Flash-Lite $0.10/$0.40 per M tok.

Leg Provider / model Full-speech hr % of dub
STTDeepgram Nova-3 ml$0.55260%
MTGemini 2.5 Flash-Lite$0.0061%
TTSCartesia Sonic$0.36039%
Dub total$0.918100%
Subtitle (no TTS)STT + MT$0.55861%
Full-speech = continuous talking, no pauses. Real calls are not full-speech (next card).
Per active-hour (~55% speech duty cycle) Dub = $0.918/hr × 0.55 = ~$0.50 / active-hr
Subtitle = $0.558/hr × 0.55 = ~$0.31 / active-hr
Per heavy user (2 hr/day × 22 days = 44 active-hrs) = 44 × $0.50 = ~$22/mo dub COGS

Sensitivity — what breaks the margin

If this moves… Dub COGS/hr Δ GM @ $12 plan Action / owner
Baseline (55% duty, retail APIs) $0.50ref
Duty cycle 55% → 80% (chatty calls) $0.73−46% Meter dub-minutes; soft-throttle to subtitle at cap. Owner: backend
STT price 2× (Deepgram hike) $0.80−60% On-device Whisper-turbo kills the $0.55 leg — highest-leverage hedge. Owner: pipeline
MT tokens 2× (verbose model) $0.51~0 pts MT is 1% of COGS — ignore. Spend quality budget freely here.
Switch TTS → ElevenLabs ($0.015/min) $0.65−30% Premium/clone tier only — never default. Owner: backend
Self-host TTS on GPU (Growth) $0.43+13% Growth-stage option, not a requirement — Cartesia is already profitable.
The one rule If margin is under pressure, the first lever is on-device STT, not self-host TTS. STT dominates. Defaults are Cartesia + Deepgram retail because they already clear the $0.50 ceiling. We meter dub-minutes and degrade dub → subtitle (40% cheaper) before we ever refuse a user.
→ See 10-financial-model for GM%/runway downstream · 05-pricing for tier COGS · source: tech.md §8.

2 · Eval harness spec

Purpose: no deploy ships without passing babelio-evals. Bar: a new engineer can read the gate, run it locally, and know exactly which metric blocked a PR. Refresh: continuous (runs every push)

For a real-time translator the eval that matters is not BLEU — it's end-to-end p95 latency on real clips. A perfect translation that lands 1.2s late is a failed dub. So we gate on two harnesses at once: Promptfoo for MT quality + glossary, and a custom latency-budget harness that replays labeled audio clips through the full STT→MT→TTS path. Either regresses → CI goes red → deploy blocked. Online Braintrust/Langfuse traces catch prod-vs-eval drift.

Metric Definition Gate (block if) Tool
E2E latency p95 speech-end → dubbed-audio-start, replayed clips > 700ms custom harness
WER word error rate of STT vs reference transcript > 12% Promptfoo
Hallucination rate invented content not in source (LLM-judge + spot-check) > 2% Promptfoo / Braintrust
MT quality (chrF / COMET) semantic adequacy vs reference translation drop > 2 pts Promptfoo
Glossary-hit rate % of term-locked names/jargon rendered correctly < 95% Promptfoo
Latency and quality are independent gates — a regression in either blocks the deploy.
Eval set
30+ labeled clips across 5 language pairs × 3 registers (conversational / lecture / stream-banter). Each clip carries src, ref, glossary terms. Grows from the flywheel (§4) — user corrections become new labeled cases.
CI gate
Every push → Rust + Python typecheck/lint/unit → babelio-evals (Promptfoo + latency harness) → red blocks merge. Desktop builds promote manually after notarization.
Online
Braintrust + Langfuse log every prod session's per-leg latency spans. If prod p95 drifts from eval p95, the clip set is stale → add the failing scenario.
# babelio-evals — CI gate (pseudo)
npx promptfoo eval -c evals/mt.yaml      # WER, hallucination, chrF, glossary
python -m evals.latency --clips data/   # replay STT→MT→TTS, assert p95

assert p95_ms      < 700
assert wer         < 0.12
assert halluc      < 0.02
assert chrf_drop   < 2.0
assert glossary    > 0.95   # any failure → exit 1 → merge blocked
→ Schema lives in evals + usage_ledger tables · source: tech.md §4 · feeds the flywheel (§4 below).

3 · Model lock-in mitigation

Purpose: no single provider can hold our margin or uptime hostage. Bar: any engineer can shadow-test an alternate provider on real traffic within one day, no code rewrite. Refresh: quarterly provider-price review

Every provider call goes through a self-hosted LiteLLM proxy (on Fly). The client never holds provider keys — it authenticates to the proxy with a device-license JWT, and the proxy owns key custody, fallback, and spend caps. Swapping a provider is a config change, not a deploy. Pre-wired fallbacks per leg:

Leg Primary Fallback / trigger
STTDeepgram Nova-3local Whisper-large-v3-turbo on 5xx / timeout >400ms
MTGemini 2.5 Flash-LiteGPT-4o-mini fallback; escalate to Gemini Flash for clone/pro
TTSCartesia SonicElevenLabs Flash v2.5 (also the premium/clone tier)

Provider-swap procedure — shadow-test in one day

  1. Add the candidate model to litellm config.yaml under the leg's model_list with a shadow: tag. ~15 min
  2. Mirror a sampled % of live traffic to the shadow model — it runs but its output is not served; results land in Braintrust. ~30 min
  3. Run babelio-evals against the shadow model on the labeled clip set (latency + WER + hallucination). ~1 hr
  4. Compare shadow vs primary on the eval bar and live cost. If it passes the gate and is cheaper/faster, flip shadow: → primary in config. ~5 min
  5. No client release needed — the proxy reloads config. Roll back instantly by reverting one line. instant
Why this is the hedge that matters STT is both our dominant cost and our biggest single-vendor exposure. The on-device Whisper-turbo fallback isn't just an outage backstop — it's a price-ceiling on Deepgram: if they hike, we shift load on-device. A mid-call provider outage must never drop the dub; LiteLLM fails over per-leg without the user noticing.

4 · Data flywheel

Purpose: the one defensible moat — instrument it day-1 or every frontier release erodes us to parity with free. Bar: an engineer knows exactly which signals to log on the first shipped build and why. Refresh: continuous (logged every session)

Fantasy vs real — read first "Lots of translated audio/transcripts" is NOT a moat. Base models don't learn from our traffic; any competitor with the same APIs regenerates the same transcripts. The audio tap is a head start, not a flywheel — public CoreAudio/WASAPI APIs, copyable in a quarter. The one real flywheel is per-app capture + dub-timing telemetry that GPT-N never sees. Skip the instrumentation and there is no company.
CAPTURE (day-1, every session) USE COMPOUND ─────────────────────────────── ────────────────── ──────────────── app_id · codec · packet jitter ──► grow babelio-evals ──► "best dub profile VAD trigger accuracy clip set per scenario per app/codec/net" mute/unmute timing ──► tune VAD + buffering ──► a new entrant can't partial-transcript stability thresholds per app re-derive without the user nudges: "dub too late", ──► each correction = a ──► same install base manual re-sync, mode switches labeled eval pair → SDK primitive ($10B) │ ▲ └────────────── usage_ledger + evals tables ────────────┘

This is a workflow-graph + eval-correction flywheel (two of the four "real" types). The schema already exists in the architecture — usage_ledger (per-leg cost, p95 latency, prompt version per billable second) and evals (clip, src/ref/hyp, quality, latency). Logging it from the first shipped build is non-negotiable, not a Growth-stage add. When models commoditize, this capture+eval layer is where value accrues — it's the asset the B2B/SDK platform thesis sells.

When it compounds (12-month estimate)

~0M0 launch
~3kM3
~15kM6
~45kM9
~120kM12

Labeled per-app sessions accumulate super-linearly with installs. The asset isn't volume — it's coverage of the (app × codec × network) matrix. Around M6–M9 we should hold enough per-app profiles that a tuned capture+VAD config measurably beats a cold competitor on the same app. That delta is the moat candidate — and the proof point for the SDK pitch.

→ Moat reasoning: 16-risk-register (copyable-moat risk) · source: ai-thesis.md §2, §4.

5 · Prompt management

Purpose: prompts are code — versioned, reviewed, regression-gated. Bar: an engineer can change the MT prompt safely, because a PR can't merge if it regresses the eval set. Refresh: per prompt change (PR-gated)

No prompt CMS day one. The MT system prompt + per-user glossary live as versioned modules in git (prompts/mt-v*.ts), and every deployed version is stamped into mt_prompt_versions (version, system_prompt, glossary_hash, deployed_at). Each billable second in usage_ledger records which prompt version produced it — so we can attribute a quality or cost shift to an exact prompt change. The glossary term-lock injector (names, jargon as a prompt prefix) is the cheap quality lever raw-NMT competitors skip.

  1. Edit a prompts/mt-v*.ts module in a branch — never hot-patch a prompt in prod.
  2. Open a PR. CI runs babelio-evals against the new prompt on the full clip set — same gate as code.
  3. Regression blocks merge. Any drop in chrF/COMET, glossary-hit, or hallucination → red → PR can't merge. A second engineer reviews the diff + the eval delta.
  4. Deploy stamps the version. New row in mt_prompt_versions; usage_ledger starts attributing sessions to it. Roll back = redeploy the prior version row.
→ Prompt versions + glossary schema: tech.md §3, §4 · regression gate = the same babelio-evals from §2.

6 · AI Act / GDPR / safety

Purpose: the compliance gates that bind a synthetic-audio product reaching EU users — what applies, the response, and who owns it. Bar: an engineer knows which gates block a release vs which can wait. Refresh: quarterly + on any regime change

Regime Applies? Engineering response Owner
EU AI Act Art. 50 transparency (binds 2 Aug 2026) YES — primary Persistent "AI dub active" UI indicator; outputs machine-readable as synthetic; deepfake disclosure for clone mode. Follow EU Code of Practice on AI-content marking. Client eng
EU AI Act Art. 4 AI literacy (live) YES — all 1-page written AI-literacy policy + staff training log. Free, mandatory, don't skip. Founder
GDPR (speech = personal data) YES Transient streaming, no default audio retention; on-device VAD/STT where hardware allows; honor erasure across transcripts. Pin no-train business tiers w/ providers. Backend
Wiretap / two-party consent YES — sleeper Don't record/store call audio by default; AUP clause + in-app nudge that recording others may need consent. Client + legal
Voice-clone consent (ELVIS Act, CA §3344, EU) Gates clone roadmap Hard block on clone release until a documented consent flow ships; mirror ElevenLabs mandatory consent confirmation; Art. 50 deepfake disclosure. Founder + legal
Mistranslation liability YES Confidence threshold on STT+MT → fall back to subtitle (preserve original audio) when low; never auto-mute a low-confidence segment. ToS output disclaimer. Pipeline
Annex III high-risk NO — stay out Don't market "exam" or "medical interpreting" without re-checking — that can pull us into high-risk (live Dec 2027). Founder
SOC 2 Not yet Skip until a real $50K+ B2B deal is blocked; then Type I in 8–12 wks via Vanta/Drata. Founder
Posture: Art. 4 policy now (free); Art. 50 UI + content-marking before any EU-reachable launch; GDPR-minimal pipeline; hard consent gate on clone. Everything else waits.
×
Don't ship clone mode without the consent flow

Voice-clone is a regulated feature, not a UX toggle. No consent capture + provider-ToS compliance + deepfake disclosure = no release.

×
Don't persist call audio by default

Live private meeting audio is the most sensitive payload possible. Transient streaming only; transcripts are opt-in; on-device where hardware allows.

×
Don't auto-mute on low confidence

If STT+MT confidence is low, never kill the original speaker — degrade to subtitle and keep the original audio. A wrong confident dub in a live meeting carries real liability.

→ Full compliance detail: legal-ops.md §3, §4 · risk owners: 16-risk-register · ToS clauses: legal-ops.md §2.
Babelio · Playbook artifact 20 of 24 · AI addendum COGS · evals · lock-in · flywheel · prompts