Turn a PDF into a narrated MP3 via Replicate Inworld TTS — chunk, cache, concat.

Find a file

hermes 3e51aff1cb Initial MVP commit		2026-05-14 03:07:12 -04:00
src/pdf2narration	Initial MVP commit	2026-05-14 03:07:12 -04:00
tests	Initial MVP commit	2026-05-14 03:07:12 -04:00
.gitignore	Initial MVP commit	2026-05-14 03:07:12 -04:00
LICENSE	Initial MVP commit	2026-05-14 03:07:12 -04:00
pyproject.toml	Initial MVP commit	2026-05-14 03:07:12 -04:00
README.md	Initial MVP commit	2026-05-14 03:07:12 -04:00
setup.py	Initial MVP commit	2026-05-14 03:07:12 -04:00

README.md

pdf2narration

Turn any PDF into a narrated MP3 in one command — chunk → Replicate Inworld TTS → ffmpeg concat. Built for skim-by-ear research workflows: papers on a walk, reports on the bus.

pdf2narration paper.pdf --pages 1-3 --voice Ashley
# → paper.mp3

Why this project

The May 14 chat thread tried to turn /Users/lucataco/Documents/PDFs/1.pdf into a narrated Manim video via the manim-video skill, but stalled at PDF discovery (iCloud not synced). That request surfaced a smaller, more reusable primitive: PDF → narrated audio. No animations, no Manim, no rendering pipeline — just clean text + Replicate Inworld TTS + ffmpeg.

This is the audio-only narrator that the Manim pipeline already wraps internally — extracted, generalized, and shipped as a standalone CLI so the next time the user wants to "listen to this PDF" it's one command, not a video render.

Ties into the broader research/tooling-leverage thread: the same Inworld Realtime TTS 1.5 Max pipeline (pinned version, ~/.env token lookup, MP3 output, on-disk cache) the user already uses for Manim explainer videos (BirdCLEF, Nemotron) — now reusable for arxiv/papers/specs/reports.

Features

PDF → cleaned text: pypdf extraction with dehyphenation, paragraph preservation, page-number stripping, common acronym expansion (LLM/RLHF/GPU/etc.) for clearer TTS.
Smart chunking: paragraph-first, sentence-fallback splitting around an 800-char budget so each TTS call stays in the model's sweet spot.
Replicate Inworld TTS: pinned to the same inworld/realtime-tts-1.5-max version used by the manim-video skill.
On-disk cache: identical chunks reuse cached MP3s — rerun cheaply.
ffmpeg concat: clean stream-copy join of per-chunk MP3s, no re-encode.
CLI flags: --pages, --voice, --rate, --temperature, --dry-run, --text-only.

Install

pip install -e .

Requires Python 3.9+ and ffmpeg on PATH.

Setup

Export your Replicate token, or put it in ~/.env:

export REPLICATE_API_TOKEN=r8_...
# or:
echo 'REPLICATE_API_TOKEN=r8_...' >> ~/.env

Usage

# Narrate the first 3 pages with the default Ashley voice
pdf2narration paper.pdf --pages 1-3

# Different voice + custom output path
pdf2narration paper.pdf --voice Dennis -o ~/Audiobooks/paper.mp3

# Preview chunks without spending TTS credits
pdf2narration paper.pdf --pages 1-3 --dry-run

# Just dump the cleaned text (pipe-friendly)
pdf2narration paper.pdf --text-only > paper.txt

Suggested voices

Ashley (default), Dennis (good for tech narration), Alex. See Replicate model page for the full list.

Project layout

src/pdf2narration/
  cli.py        argparse entrypoint
  extract.py    PDF text + cleanup + chunking
  tts.py        Replicate Inworld client + cache
  audio.py      ffmpeg concat helper
tests/
  test_extract.py

License

MIT