Babelio — Investor Pitch

01 · Problem

Translation lives in silos.

A serious language learner watches native foreign content on desktop for hours a day. The live stream has no fan-sub. The lecture player is a native app, not a browser tab. So they pause every sentence, look up words, lose the flow — and give up.

Zoom captions stay in Zoom. YouTube subs stay in YouTube. Teams AI stays in Teams. Browser extensions cover only web tabs.
None of them reach a native installed desktop client — a desktop LMS player, Webex, VLC, a regional conferencing tool, a desktop game.
And for a learner, killing the original audio kills the job: they need to hear the source while reading the translation.

Pain frequency

7–21/wk

Already spend on tools

$60–180/yr

Comprehension on native

~60%

02 · Solution

One OS-level layer. Any app.

A desktop app for Mac and Windows that captures audio from any running process and overlays a real-time translation — in under 700ms, the threshold where it reads as interpretation, not a delayed echo.

Dual-track (default): live captions + a quiet whisper-dub layered under the preserved original voice. The learner keeps the source for ear-training and gets the missing 40% filled in.
Auto-mute + dub (optional): for users who want hands-off comprehension over learning — original muted, AI voice on top.
One-click Anki / export: capture a line + its translation and mine it to flashcards — the learner's own study artifact.

Pick the app, pick the language once, toggle on. It feels like one button — that's the whole promise.

03 · Why Now

Two curves crossed in 2025.

One enables the experience. One opens a lane nobody else can take. We're honest about which is which.

Latency floor dropped under 700ms (shared). Streaming STT <300ms (Deepgram Nova-3), LLM-MT <200ms, streaming TTS <200ms TTFB (Cartesia Sonic). Real — but it crossed for everyone on the same APIs. This alone is not a moat.
OS-level audio capture unlocked (narrow, defensible). macOS 14 CoreAudio process taps and Windows 11 audio-session APIs let third-party apps capture per-process audio from native clients without kernel extensions — first stable across both OSes in 2024–25. A browser extension structurally cannot do this.

The honest read

The monetizable edge rests on inflection #2 alone. "Translate any app for everyone" rides a trend everyone rides. The defensible play is native-desktop capture — and what we build on top of it.

04 · Market

A $13M wedge inside a $72M TAM.

Bottom-up, not market-report inflation. We lead with the wedge we can actually win, not the ceiling.

TAM

$72M

500k serious immersion learners × $144/yr

SAM ⟵

$13M

~90k desktop, English-first, already-paying — the headline wedge

SOM Y3

$1.8MARR

Bottom-up by channel, not % of SAM

The S2S translation segment is ~$481.6M (2025); AI-in-translation hits $3.68B in 2026 at 25% CAGR. A broad-consumer ceiling — 5M heavy English-first users × $144 — would be ~$720M, but that zone is already capped by Zoom's free feature and free browser extensions. We treat $720M as headroom, never as the base case.

Sources: Expert Market Research (S2S $481.6M), The Business Research Company (AI translation $3.68B / 25.2% CAGR).

05 · Product

Pick the app. Toggle on.

The whole flow is one screen and one switch. Everything else runs in the background.

1 · Trigger. User opens a native app with foreign-language audio — a lecture player, Webex, a stream.
2 · Pick. Babelio tray → choose the app from the capture picker → confirm target language (remembered) → toggle ON.
3 · Translate. VAD detects speech → streaming STT→MT→TTS plays a quiet whisper under the original + a confidence-shaded caption overlay, all <700ms behind source.
4 · Review. Session transcript saved — aligned bilingual, jump-to-timestamp, export to Anki. Every correction feeds the eval set.

North Star: weekly native-desktop sessions translated end-to-end ≥ 4 per active user. Did they trust it enough to leave it running over a real lecture.

06 · Business Model

$12/mo, metered for margin.

Hybrid: flat $12/mo Pro base + metered translation-minutes via a reverse trial. Cheap subtitle minutes are the monetized core; dub minutes are metered to protect margin.

COGS subtitle / dub

$0.31/$0.50 hr

Gross margin (Pro)

73%

LTV : CAC

~3:1

COGS (source of truth, Cartesia + Deepgram, 55% duty cycle): ~$0.50/active dub-hr, ~$0.31/active subtitle-hr. STT is the dominant leg, not TTS — so falling model prices help us.
Unit economics: ARPU $12, blended COGS ~$3.26/Pro user → 73% GM. CAC $35, payback 4.0 months, LTV ~$120 at 7%/mo churn → LTV:CAC ~3:1 (capped — not the retracted 5.7:1).

The #1 thing we're validating

The WTP ceiling ($6–12/mo, capped by free extensions) sits near the COGS floor for heavy dub usage. We resolve this by mode — subtitle is the cheap hero mode learners actually want. Van Westendorp + a metered $5/hr concierge test on 30–50 buyers proves it before we lock tiers.

07 · Traction

Prototype. Zero users — and we'll say so.

A working prototype exists. No users, no revenue, no LOIs yet. Everything past that is a validation plan, not a claim. We'd rather be honest than inflate.

Week 1 — Wedge & WTP. 15–20 Mom-Test interviews with immersion learners from r/LearnJapanese, Refold / MoeWay / Migaku. Kill if <30% recall a specific painful live-content incident in 30 days.
Week 2 — Concierge WTP. Hand-deliver subtitles to 5–10 learners, charge metered $5/hr. Output: 5 paying / LOI users, 3 testimonials. Kill if nobody pays a rate that clears COGS.
Week 3 — Thin product + loop. Ship narrowest native-capture flow on one OS, instrument the sentence-mining-card activation event + loop factor K. Kill if <40% activate even hand-held.
Week 4 — Channel test. Time the launch to a recurring immersion-challenge kickoff. Measure CAC, install-completion, K, GM/user. Go / no-go with real numbers.

PMF status: PENDING. We claim PMF only at ≥40% "very disappointed" (Sean Ellis) across ≥30 hands. Not before.

08 · Competition

Everyone owns a box. Nobody owns ours.

Axes: app coverage (single-platform ↔ OS-wide) × output (captions ↔ voice dub). Our only defensible cell is native-installed-app voice dub with original audio preserved.

Player

Reach

Native apps

Live audio

Keeps original

Zoom AI Voice Translator

Zoom only, 5 langs

✕

✓

✕

DubTab / Whisperr

Browser tabs, 60+ langs

✕

✓

✕

Teams + DeepL Voice

Teams only

✕

✓

✕

Free captions (do-nothing)

Per-platform

✕

~

✓

Babelio

OS-wide, native

✓

Big Tech will ship in-product translation — but an OS-wide cross-vendor layer cannibalizes nobody's core, so no single platform owner is incentivized to build it. The open lane survives precisely because it sits between the walled gardens.

09 · Moat

The capture is a head start. The telemetry is the moat.

We're blunt: the audio tap uses public CoreAudio / WASAPI APIs — a funded competitor replicates it in a quarter. It buys time to collect the one asset that compounds.

Per-app capture + eval telemetry flywheel — instrumented from session one. Which app, codec, packet jitter, VAD false-trigger rate, mute/unmute timing, every user correction. GPT-N never sees it. Over thousands of sessions you learn the optimal capture profile per application — a new entrant can't re-derive it without the same install base.
Switching costs, then network effects — per-app profiles + persistent voice/glossary settings lock the workflow; multi-party team dubbing makes the product more valuable as colleagues join. Buildable, not present at MVP.

The $10B headline thesis

The $10B outcome is not the consumer app — it's becoming the cross-application real-time-dub primitive that other software embeds. The telemetry → an eval-proven "best dub per app / codec / network" capability → packaged as a virtual-mic / SDK that conferencing, accessibility, e-learning, and contact-center vendors license rather than rebuild.

When the model is free, value accrues below it — at the capture + eval layer. That's where we sell.

10 · Go-to-Market

Seed the communities that already mine sentences.

PLG self-serve + founder-led community seeding for the first 100. No outbound sales at $144/yr. The buyer is a single learner who already funds their own immersion stack.

Primary (80%) — community-led. r/LearnJapanese (~700k), r/Korean, Refold / TheMoeWay / Migaku Discords. Dense, named, calendar-timed (immersion-challenge kickoffs are recurring trigger events). CAC $15–35 [hypothesis].
Secondary (20%) — SEO / content. High-intent long-tail extensions can't serve: "watch Japanese stream with subtitles desktop", "sentence mining live VTuber stream."
Growth loop — legally-clean. The shareable artifact is the user's own bilingual sentence-mining card (their annotation, never the source media), posted where mining screenshots are already native behavior. Loop factor K ≥ 0.3 to prove; K < 0.15 → fall back to paid CAC.

Install friction (Gatekeeper / antivirus) is the #3 risk — mitigated by notarized signed builds, a first-run trust playbook, and seeding via a trusted community member, not a cold drop.

11 · Team

The honest open gap. One hire decides everything.

We won't pretend otherwise: the entire moat and the 10-week MVP hinge on one rare profile — a Rust + CoreAudio / WASAPI engineer who can ship per-process native audio capture across both OSes. That person is not yet confirmed.

What we're solving with this round. Confirm the OS-internals co-founder / first hire and record verifiable founder background in audio / OS-internals + distribution audience.
Why we surface it. A pitch that hides its weakest link wastes everyone's time. If the capture engineer doesn't commit, the timeline and the moat slip — so we treat the hire as the first milestone of the round, not an afterthought.

Founder decisions in flight

Funding path (raise vs bootstrap to ~$10K MRR) and the audio-engineer commitment are open. The economics support a conditional bootstrap — Cartesia retail COGS is viable from day one, breakeven ~460 paying users — so the raise is to compress time-to-moat, not to survive.

12 · The Ask

Raise to compress time-to-moat.

Pre-seed · 15–18 months runway

$1.2–1.5M pre-seed

to ~$400–500K ARR + a defensible native-desktop-capture wedge with a live telemetry flywheel

55%

Engineering

The Rust + CoreAudio/WASAPI capture hire + founder. Ships the native pipeline, telemetry flywheel, eval harness.

25%

Validation & GTM

4-week field plan, community seeding, immersion-YouTuber integrations, the loop-K test.

20%

Infra & COGS buffer

Cartesia / Deepgram inference, signing & notarization, billing, the on-device-STT margin lever.

Milestones: capture engineer committed (week 0) → 4-week validation passed with real loop-K and CAC (month 1) → monetizable MVP shipped at ≤$0.50/active-hr (month 4) → ~$10K MRR (month 9–10) → the telemetry asset that underwrites the B2B/SDK Series A.

founders@babelio.app ↑ Back to top

Translation lives in silos. Перевод заперт в силосах.

One OS-level layer. Any app. Один слой на уровне ОС. Любое приложение.

Two curves crossed in 2025. Две кривые пересеклись в 2025-м.

The honest readЧестная трактовка

A $13M wedge inside a $72M TAM. Клин на $13M внутри TAM в $72M.

Pick the app. Toggle on. Выбери приложение. Включи.

$12/mo, metered for margin. $12/мес, с учётом минут ради маржи.

The #1 thing we're validatingЧто валидируем в первую очередь

Prototype. Zero users — and we'll say so. Прототип. Ноль пользователей — и мы это говорим.

Everyone owns a box. Nobody owns ours. У каждого своя клетка. Нашу не занял никто.

The capture is a head start. The telemetry is the moat. Захват — это фора. Ров — это телеметрия.

The $10B headline thesisТезис на $10 млрд

Seed the communities that already mine sentences. Засеять сообщества, которые уже майнят предложения.

The honest open gap. One hire decides everything. Честный открытый пробел. Один наём решает всё.

Founder decisions in flightРешения фаундера в работе

Raise to compress time-to-moat. Раунд, чтобы сжать время до рва.