Engineering onboarding · The AI parts a SaaS playbook skips

AI addendum

The five AI-specific things a new engineer must understand in one read: what each translated minute costs us, how evals gate every deploy, why a provider can't lock us in, what telemetry compounds into a moat, and how prompts are versioned. Numbers inherited from tech.md §8 as the single source of truth — not re-derived.

Dub COGS

$0.50/active-hr

Subtitle COGS

$0.31/active-hr

Eval bar

<700ms

Lock-in hedge

LiteLLM

Read this if you're new Babelio is not a RAG or agent app — it's a fixed STT → MT → TTS streaming pipeline with a <700ms latency budget. There's no agent loop to reason about. The "AI" you operate is three swappable provider legs behind a proxy, gated by evals, instrumented for a telemetry flywheel. Everything below is operational: cost cells you can recompute, a procedure you can run, a CI gate you can read. Babelio — это не RAG и не агент. Это фиксированный стриминговый пайплайн STT → MT → TTS с бюджетом задержки <700мс. Никакого агентного цикла осмыслять не нужно. «AI», который ты обслуживаешь, — это три заменяемых провайдер-плеча за прокси, закрытых евалами и обвешанных телеметрией для флайвила. Всё ниже — операционка: ячейки себестоимости, которые можно пересчитать, процедура, которую можно прогнать, CI-гейт, который можно прочитать.

1 · AI cost-of-goods model

Purpose: every engineer can name what a translated minute costs and which leg dominates. Bar: you can recompute the per-active-hour number from the cells below without asking the founder. Refresh: quarterly + on any provider price change

The make-or-break insight: STT is ~60% of dub COGS, not TTS. Anyone who tells you "TTS is 80% of the bill" is reading a stale doc. Per-leg unit prices (May 2026): Deepgram Nova-3 multilingual $0.0092/min, Cartesia Sonic $0.006/min, Gemini 2.5 Flash-Lite $0.10/$0.40 per M tok.

Full-speech = continuous talking, no pauses. Real calls are not full-speech (next card).
Leg	Provider / model	Full-speech hr	% of dub
STT	Deepgram Nova-3 ml	$0.552	60%
MT	Gemini 2.5 Flash-Lite	$0.006	1%
TTS	Cartesia Sonic	$0.360	39%
Dub total	—	$0.918	100%
Subtitle (no TTS)	STT + MT	$0.558	61%

Per active-hour (~55% speech duty cycle) Dub = $0.918/hr × 0.55 = ~$0.50 / active-hr
Subtitle = $0.558/hr × 0.55 = ~$0.31 / active-hr
Per heavy user (2 hr/day × 22 days = 44 active-hrs) = 44 × $0.50 = ~$22/mo dub COGS

Sensitivity — what breaks the margin

If this moves…	Dub COGS/hr	Δ GM @ $12 plan	Action / owner
Baseline (55% duty, retail APIs)	$0.50	ref	—
Duty cycle 55% → 80% (chatty calls)	$0.73	−46%	Meter dub-minutes; soft-throttle to subtitle at cap. Owner: backend
STT price 2× (Deepgram hike)	$0.80	−60%	On-device Whisper-turbo kills the $0.55 leg — highest-leverage hedge. Owner: pipeline
MT tokens 2× (verbose model)	$0.51	~0 pts	MT is 1% of COGS — ignore. Spend quality budget freely here.
Switch TTS → ElevenLabs ($0.015/min)	$0.65	−30%	Premium/clone tier only — never default. Owner: backend
Self-host TTS on GPU (Growth)	$0.43	+13%	Growth-stage option, not a requirement — Cartesia is already profitable.

The one rule If margin is under pressure, the first lever is on-device STT, not self-host TTS. STT dominates. Defaults are Cartesia + Deepgram retail because they already clear the $0.50 ceiling. We meter dub-minutes and degrade dub → subtitle (40% cheaper) before we ever refuse a user. Если маржа под давлением, первый рычаг — on-device STT, а не свой TTS. Доминирует STT. По умолчанию Cartesia + Deepgram ретейл, потому что они уже укладываются в потолок $0.50. Мы считаем минуты озвучки и деградируем озвучка → субтитры (на 40% дешевле), прежде чем вообще отказать пользователю.

→ See 10-financial-model for GM%/runway downstream · 05-pricing for tier COGS · source: tech.md §8. → См. 10-financial-model для GM%/runway ниже по цепочке · 05-pricing для себестоимости тарифов · источник: tech.md §8.

2 · Eval harness spec

Purpose: no deploy ships without passing babelio-evals. Bar: a new engineer can read the gate, run it locally, and know exactly which metric blocked a PR. Refresh: continuous (runs every push)

For a real-time translator the eval that matters is not BLEU — it's end-to-end p95 latency on real clips. A perfect translation that lands 1.2s late is a failed dub. So we gate on two harnesses at once: Promptfoo for MT quality + glossary, and a custom latency-budget harness that replays labeled audio clips through the full STT→MT→TTS path. Either regresses → CI goes red → deploy blocked. Online Braintrust/Langfuse traces catch prod-vs-eval drift.

Latency and quality are independent gates — a regression in either blocks the deploy.
Metric	Definition	Gate (block if)	Tool
E2E latency p95	speech-end → dubbed-audio-start, replayed clips	> 700ms	custom harness
WER	word error rate of STT vs reference transcript	> 12%	Promptfoo
Hallucination rate	invented content not in source (LLM-judge + spot-check)	> 2%	Promptfoo / Braintrust
MT quality (chrF / COMET)	semantic adequacy vs reference translation	drop > 2 pts	Promptfoo
Glossary-hit rate	% of term-locked names/jargon rendered correctly	< 95%	Promptfoo

Eval set

30+ labeled clips across 5 language pairs × 3 registers (conversational / lecture / stream-banter). Each clip carries src, ref, glossary terms. Grows from the flywheel (§4) — user corrections become new labeled cases.30+ размеченных клипов на 5 языковых пар × 3 регистра (разговор / лекция / стрим-болтовня). Каждый клип несёт src, ref, термины глоссария. Растёт из флайвила (§4) — правки пользователей становятся новыми кейсами.

CI gate

Every push → Rust + Python typecheck/lint/unit → babelio-evals (Promptfoo + latency harness) → red blocks merge. Desktop builds promote manually after notarization.

Online

Braintrust + Langfuse log every prod session's per-leg latency spans. If prod p95 drifts from eval p95, the clip set is stale → add the failing scenario.

# babelio-evals — CI gate (pseudo)
npx promptfoo eval -c evals/mt.yaml      # WER, hallucination, chrF, glossary
python -m evals.latency --clips data/   # replay STT→MT→TTS, assert p95

assert p95_ms      < 700
assert wer         < 0.12
assert halluc      < 0.02
assert chrf_drop   < 2.0
assert glossary    > 0.95   # any failure → exit 1 → merge blocked

→ Schema lives in evals + usage_ledger tables · source: tech.md §4 · feeds the flywheel (§4 below). → Схема в таблицах evals + usage_ledger · источник: tech.md §4 · питает флайвил (§4 ниже).

3 · Model lock-in mitigation

Purpose: no single provider can hold our margin or uptime hostage. Bar: any engineer can shadow-test an alternate provider on real traffic within one day, no code rewrite. Refresh: quarterly provider-price review

Every provider call goes through a self-hosted LiteLLM proxy (on Fly). The client never holds provider keys — it authenticates to the proxy with a device-license JWT, and the proxy owns key custody, fallback, and spend caps. Swapping a provider is a config change, not a deploy. Pre-wired fallbacks per leg:

Leg	Primary	Fallback / trigger
STT	Deepgram Nova-3	local Whisper-large-v3-turbo on 5xx / timeout >400ms
MT	Gemini 2.5 Flash-Lite	GPT-4o-mini fallback; escalate to Gemini Flash for clone/pro
TTS	Cartesia Sonic	ElevenLabs Flash v2.5 (also the premium/clone tier)

Provider-swap procedure — shadow-test in one day

Add the candidate model to litellm config.yaml under the leg's model_list with a shadow: tag. ~15 min
Mirror a sampled % of live traffic to the shadow model — it runs but its output is not served; results land in Braintrust. ~30 min
Run babelio-evals against the shadow model on the labeled clip set (latency + WER + hallucination). ~1 hr
Compare shadow vs primary on the eval bar and live cost. If it passes the gate and is cheaper/faster, flip shadow: → primary in config. ~5 min
No client release needed — the proxy reloads config. Roll back instantly by reverting one line. instant

Why this is the hedge that matters STT is both our dominant cost and our biggest single-vendor exposure. The on-device Whisper-turbo fallback isn't just an outage backstop — it's a price-ceiling on Deepgram: if they hike, we shift load on-device. A mid-call provider outage must never drop the dub; LiteLLM fails over per-leg without the user noticing.

4 · Data flywheel

Purpose: the one defensible moat — instrument it day-1 or every frontier release erodes us to parity with free. Bar: an engineer knows exactly which signals to log on the first shipped build and why. Refresh: continuous (logged every session)

Fantasy vs real — read first "Lots of translated audio/transcripts" is NOT a moat. Base models don't learn from our traffic; any competitor with the same APIs regenerates the same transcripts. The audio tap is a head start, not a flywheel — public CoreAudio/WASAPI APIs, copyable in a quarter. The one real flywheel is per-app capture + dub-timing telemetry that GPT-N never sees. Skip the instrumentation and there is no company.

CAPTURE (day-1, every session) USE COMPOUND ─────────────────────────────── ────────────────── ──────────────── app_id · codec · packet jitter ──► grow babelio-evals ──► "best dub profile VAD trigger accuracy clip set per scenario per app/codec/net" mute/unmute timing ──► tune VAD + buffering ──► a new entrant can't partial-transcript stability thresholds per app re-derive without the user nudges: "dub too late", ──► each correction = a ──► same install base manual re-sync, mode switches labeled eval pair → SDK primitive ($10B) │ ▲ └────────────── usage_ledger + evals tables ────────────┘

This is a workflow-graph + eval-correction flywheel (two of the four "real" types). The schema already exists in the architecture — usage_ledger (per-leg cost, p95 latency, prompt version per billable second) and evals (clip, src/ref/hyp, quality, latency). Logging it from the first shipped build is non-negotiable, not a Growth-stage add. When models commoditize, this capture+eval layer is where value accrues — it's the asset the B2B/SDK platform thesis sells.

When it compounds (12-month estimate)

~0M0 launch

~3kM3

~15kM6

~45kM9

~120kM12

Labeled per-app sessions accumulate super-linearly with installs. The asset isn't volume — it's coverage of the (app × codec × network) matrix. Around M6–M9 we should hold enough per-app profiles that a tuned capture+VAD config measurably beats a cold competitor on the same app. That delta is the moat candidate — and the proof point for the SDK pitch.

→ Moat reasoning: 16-risk-register (copyable-moat risk) · source: ai-thesis.md §2, §4. → Логика моата: 16-risk-register (риск копируемого моата) · источник: ai-thesis.md §2, §4.

5 · Prompt management

Purpose: prompts are code — versioned, reviewed, regression-gated. Bar: an engineer can change the MT prompt safely, because a PR can't merge if it regresses the eval set. Refresh: per prompt change (PR-gated)

No prompt CMS day one. The MT system prompt + per-user glossary live as versioned modules in git (prompts/mt-v*.ts), and every deployed version is stamped into mt_prompt_versions (version, system_prompt, glossary_hash, deployed_at). Each billable second in usage_ledger records which prompt version produced it — so we can attribute a quality or cost shift to an exact prompt change. The glossary term-lock injector (names, jargon as a prompt prefix) is the cheap quality lever raw-NMT competitors skip.

Edit a prompts/mt-v*.ts module in a branch — never hot-patch a prompt in prod.
Open a PR. CI runs babelio-evals against the new prompt on the full clip set — same gate as code.
Regression blocks merge. Any drop in chrF/COMET, glossary-hit, or hallucination → red → PR can't merge. A second engineer reviews the diff + the eval delta.
Deploy stamps the version. New row in mt_prompt_versions; usage_ledger starts attributing sessions to it. Roll back = redeploy the prior version row.

→ Prompt versions + glossary schema: tech.md §3, §4 · regression gate = the same babelio-evals from §2. → Версии промптов + схема глоссария: tech.md §3, §4 · регресс-гейт = тот же babelio-evals из §2.

6 · AI Act / GDPR / safety

Purpose: the compliance gates that bind a synthetic-audio product reaching EU users — what applies, the response, and who owns it. Bar: an engineer knows which gates block a release vs which can wait. Refresh: quarterly + on any regime change

Posture: Art. 4 policy now (free); Art. 50 UI + content-marking before any EU-reachable launch; GDPR-minimal pipeline; hard consent gate on clone. Everything else waits.
Regime	Applies?	Engineering response	Owner
EU AI Act Art. 50 transparency (binds 2 Aug 2026)	YES — primary	Persistent "AI dub active" UI indicator; outputs machine-readable as synthetic; deepfake disclosure for clone mode. Follow EU Code of Practice on AI-content marking.	Client eng
EU AI Act Art. 4 AI literacy (live)	YES — all	1-page written AI-literacy policy + staff training log. Free, mandatory, don't skip.	Founder
GDPR (speech = personal data)	YES	Transient streaming, no default audio retention; on-device VAD/STT where hardware allows; honor erasure across transcripts. Pin no-train business tiers w/ providers.	Backend
Wiretap / two-party consent	YES — sleeper	Don't record/store call audio by default; AUP clause + in-app nudge that recording others may need consent.	Client + legal
Voice-clone consent (ELVIS Act, CA §3344, EU)	Gates clone roadmap	Hard block on clone release until a documented consent flow ships; mirror ElevenLabs mandatory consent confirmation; Art. 50 deepfake disclosure.	Founder + legal
Mistranslation liability	YES	Confidence threshold on STT+MT → fall back to subtitle (preserve original audio) when low; never auto-mute a low-confidence segment. ToS output disclaimer.	Pipeline
Annex III high-risk	NO — stay out	Don't market "exam" or "medical interpreting" without re-checking — that can pull us into high-risk (live Dec 2027).	Founder
SOC 2	Not yet	Skip until a real $50K+ B2B deal is blocked; then Type I in 8–12 wks via Vanta/Drata.	Founder

Don't ship clone mode without the consent flow

Voice-clone is a regulated feature, not a UX toggle. No consent capture + provider-ToS compliance + deepfake disclosure = no release.

Don't persist call audio by default

Live private meeting audio is the most sensitive payload possible. Transient streaming only; transcripts are opt-in; on-device where hardware allows.

Don't auto-mute on low confidence

If STT+MT confidence is low, never kill the original speaker — degrade to subtitle and keep the original audio. A wrong confident dub in a live meeting carries real liability.

→ Full compliance detail: legal-ops.md §3, §4 · risk owners: 16-risk-register · ToS clauses: legal-ops.md §2. → Полные детали комплаенса: legal-ops.md §3, §4 · владельцы рисков: 16-risk-register · пункты ToS: legal-ops.md §2.

AI addendumAI-приложение

1 · AI cost-of-goods model1 · Модель себестоимости AI

Sensitivity — what breaks the marginЧувствительность — что ломает маржу

2 · Eval harness spec2 · Спецификация eval-харнеса

3 · Model lock-in mitigation3 · Защита от лок-ина модели

Provider-swap procedure — shadow-test in one dayПроцедура смены провайдера — shadow-тест за один день

4 · Data flywheel4 · Флайвил данных

When it compounds (12-month estimate)Когда это копится (оценка на 12 мес)

5 · Prompt management5 · Управление промптами

6 · AI Act / GDPR / safety6 · AI Act / GDPR / безопасность

Don't ship clone mode without the consent flowНе выпускай clone без флоу согласия

Don't persist call audio by defaultНе храни аудио звонков по умолчанию

Don't auto-mute on low confidenceНе мьютить при низкой уверенности