Turn PDFs and EPUBs into spoken audio from a single local web app. Pick a page range, choose a TTS engine, preview what is available, and export either one MP3 or a ZIP of chapter files.
This repo is optimized for practical use, not as a framework. It gives you one place to:
- convert books from
input/or direct uploads - switch between local and cloud speech engines
- clone voices with reference audio
- save outputs back into the repo for repeatable runs
- come back later and still remember how the whole thing works
Most TTS tools make you choose between flexibility and convenience. This app keeps both:
- a clean browser UI instead of one-off scripts
- multiple backends instead of one engine lock-in
- chapter-aware conversion instead of only full-book output
- local-first options when you want privacy or offline experimentation
- cloud backends when you want convenience or specific voices
| Capability | Details |
|---|---|
| Input formats | PDF and EPUB |
| Source options | Upload from your machine or pick files from input/ |
| Output formats | Single MP3 or chapter ZIP |
| Save mode | Download immediately or save into output/ |
| Voice cloning | Upload reference audio and synthesize with XTTS |
| Chapter mode | Works for both PDF and EPUB |
| Progress tracking | Browser polls server-side job progress during conversion |
| Language filtering | UI can filter backends by English / Romanian |
| Backend | Type | Languages | Best For | First Run / Dependency Notes |
|---|---|---|---|---|
| Kokoro | Local | English | Best default local quality-to-speed balance | Requires espeak-ng; downloads model on first use |
| Piper | Local | English | Lightweight local ONNX voices and tunable pacing | Needs a .onnx model plus sidecar JSON in models/ |
| Supertonic | Local | English | ONNX-based local synthesis with simple controls | Downloads about 305MB on first use |
| Hugging Face | Local | English, Romanian | Trying HF TTS models locally | Downloads model weights on first use |
| SpeechT5 | Local | English | Multi-speaker preset voices | Downloads model, vocoder, and speaker embeddings |
| XTTS-v2 | Local | English + multilingual | Voice cloning with a reference sample | Large model download, reference audio required |
| XTTS-v2 Romanian | Local | Romanian | Romanian voice cloning | Large Romanian fine-tune download, reference audio required |
| Amazon Polly | Cloud | English in current UI | Reliable cloud voices and billing visibility | Requires valid AWS credentials |
| HF Inference API | Cloud | English, Romanian | Quick cloud inference experiments | Requires HF_TOKEN; free tier can rate-limit |
The shortest clean setup is two commands:
./scripts/bootstrap.sh
uv run python app.pyThen open http://localhost:1234.
If you want every optional backend as well:
./scripts/bootstrap.sh --all
uv run python app.py./scripts/bootstrap.sh:
- installs
uvif missing - installs
ffmpegandespeak-ng - installs Python 3.12 via
uv - runs
uv sync --frozen --dev - creates
.envfrom.env.exampleif needed
It currently supports:
- macOS with Homebrew
- Ubuntu / Debian style systems with
apt-get
If you prefer not to use the bootstrap script, the equivalent manual flow is below.
macOS or Ubuntu:
curl -LsSf https://astral.sh/uv/install.sh | shIf you already use Homebrew on macOS:
brew install uvThis repo is set up for Python 3.12 because several TTS backends are still fragile on newer interpreters.
uv python install 3.12The repo includes .python-version, so uv will use Python 3.12 automatically.
This project needs a couple of tools that are outside Python dependency management.
| Tool | Required? | Why |
|---|---|---|
ffmpeg |
Yes | pydub uses it for MP3 import/export and audio format conversion |
espeak-ng |
Required for Kokoro | Kokoro depends on it for phonemization |
piper CLI |
Optional | Only needed if Piper falls back from the Python API to the CLI binary |
pydub relies on ffmpeg for MP3 import/export, and Kokoro relies on espeak-ng.
macOS:
brew install ffmpeg espeak-ngIf you want the optional Piper CLI available as a fallback:
brew install piperUbuntu / Debian:
sudo apt-get update
sudo apt-get install -y ffmpeg espeak-ngIf you want the optional Piper CLI available as a fallback:
sudo apt-get install -y piperQuick sanity check:
ffmpeg -version
espeak-ng --versionFor the default app experience, including the web app, Kokoro, Polly support, and the fast test suite:
uv sync --frozen --devIf you want every optional backend available locally:
uv sync --frozen --dev --extra all| Extra | Adds |
|---|---|
piper |
Piper Python backend support |
supertonic |
Supertonic local ONNX backend |
huggingface |
Local Transformers TTS backend |
xtts |
XTTS / XTTS-RO voice cloning backends |
speecht5 |
SpeechT5 multi-speaker backend |
ocr |
EasyOCR fallback for scanned PDFs |
all |
All optional backends above |
Examples:
uv sync --frozen --dev --extra piper --extra supertonic
uv sync --frozen --dev --extra xtts --extra ocrcp .env.example .envYou do not need to fill everything in up front. For a first successful run, the minimum usually is:
- leave the default
TTS_BACKEND=kokoro - keep Kokoro defaults as-is
- skip cloud credentials until you want Polly or HF cloud
uv run python app.pyOpen:
http://localhost:1234
- Open the Convert page.
- Upload a PDF or EPUB, or place one inside
input/and pick it from the dropdown. - Keep Kokoro selected for the simplest local path.
- Leave page range as
allor choose a range such as1-10. - Click Convert to MP3.
./scripts/bootstrap.sh
./scripts/bootstrap.sh --all
uv sync --frozen --dev
uv sync --frozen --dev --extra all
uv lock
uv sync --frozen --dev
uv run python app.py
uv run pytest tests/test_smoke.py -qUse uv lock only when you intentionally want to update the pinned dependency set.
The full configuration surface lives in ./.env.example.
Recommended workflow:
cp .env.example .envThen edit only the values you care about.
| Setting | Why It Matters |
|---|---|
TTS_BACKEND |
Sets the default backend shown on page load |
AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_REGION |
Required for Polly readiness and conversion |
PIPER_MODEL |
Points to the default Piper voice model |
HF_TOKEN |
Required for HF cloud inference |
KOKORO_VOICE / KOKORO_SPEED |
Sets the default local voice and pace |
SUPERTONIC_* |
Sets default Supertonic voice, language, speed, and silence |
- If a variable is not in
.env.example, the app does not currently rely on it. .env.exampleuses safe placeholders and code defaults only.- Keep real secrets in
.envonly. - The app loads
.envautomatically throughpython-dotenv.
The main workflow is a three-step wizard:
-
Source
- Upload a
.pdfor.epub - Or pick a file already present in
input/
- Upload a
-
Pages
- Use
all,5, or1-10 - The app fetches metadata and tries to detect chapters
- If there are blank front/back pages, the UI suggests a likely text range
- Use
-
Engine & Voice
- Pick a backend
- The UI loads voice lists dynamically where needed
- The status badge shows whether a backend is ready or still needs setup/downloads
If chapters are detected, you can enable Split into chapter files.
Important behavior:
- chapter mode converts the whole document by detected chapters
- the result is a ZIP archive of MP3 files
- this works for both PDF and EPUB
- the typed page range is not the controlling unit when chapter mode is on
Enable Save to output/ folder if you want the results kept in the repo instead of downloaded directly by the browser.
Use the Voice Cloning page to upload short reference audio:
- WAV, MP3, M4A, FLAC, OGG
- max 10MB
- 3 to 10 seconds is ideal
Those files are stored in reference_audio/ and become selectable when using XTTS backends.
- PDF with extractable text
- EPUB with a valid content spine
- reference audio for XTTS voice cloning
- single MP3
- ZIP of per-chapter MP3 files
- optional saved artifacts under
output/
Generated filenames include:
- UTC timestamp
- source book name
- page or chapter label
- backend
- voice/model detail
- relevant prosody settings when applicable
.
├── app.py
├── pyproject.toml
├── .python-version
├── scripts/
│ └── bootstrap.sh
├── templates/
│ └── index.html
├── input/
├── output/
├── reference_audio/
├── tests/
├── models/
├── .env.example
└── uv.lock
What each piece does:
app.py: the full Flask app, document extraction pipeline, backend dispatch, and routespyproject.toml: dependency definitions, optional backend extras, and pytest config foruv.python-version: pins the interpreter version used byuvscripts/bootstrap.sh: one-command local setup for supported macOS and Ubuntu/Debian environmentstemplates/index.html: the UI, styling, and browser-side JavaScript in one templateinput/: books you want selectable from the dropdownoutput/: saved MP3s and chapter ZIP directoriesreference_audio/: uploaded voice-cloning samplesmodels/: Piper.onnxmodels and their.jsonsidecarstests/: fast regression coverage plus focused route and extraction testsuv.lock: fully resolved dependency lockfile generated byuv
The app is intentionally simple:
-
Flask server
- serves the UI
- accepts uploads and convert requests
- exposes helper APIs for backend status, chapter metadata, and saved reference voices
-
Inline frontend
- HTML, CSS, and JavaScript live in
templates/index.html - dynamic UI behavior uses
fetch - progress is polled from
/progress/<job_id>
- HTML, CSS, and JavaScript live in
-
Document processing
- PDFs are read with
PyMuPDF/fitz - EPUBs are parsed from the content spine
- extracted text is cleaned before TTS
- PDFs are read with
-
Speech synthesis
- backend-specific wrappers normalize output into MP3 buffers
- heavy models are cached lazily in process
- chapter mode writes one file per chapter, then zips when needed
-
Local cache management
- Hugging Face and Supertonic caches are redirected into a project-local
.cache/
- Hugging Face and Supertonic caches are redirected into a project-local
| Route | Purpose |
|---|---|
/ |
Main app UI |
/convert |
Starts a conversion job and returns MP3, ZIP, or saved-result JSON |
/progress/<job_id> |
Current extraction / synthesis progress |
/api/backend-status |
Readiness notes for local and cloud backends |
/api/pdf-info |
Page and chapter metadata for PDFs and EPUBs |
/api/reference-voices |
Lists uploaded XTTS reference samples |
/api/upload-reference |
Uploads a new reference audio file |
Before synthesis, the app tries to make extracted book text sound better:
- joins hard-wrapped lines
- expands common abbreviations
- converts some numbers to words
- normalizes ligatures and punctuation
- strips URLs and citation-style brackets
- preserves normal bracketed prose such as
[aside] - detects and strips likely running headers in PDFs
This is a practical cleanup layer, not a perfect document normalizer.
Run the fast regression suite:
uv run pytest tests/test_smoke.py -qThe current tests focus on:
- non-destructive text preprocessing
- EPUB chapter support
- Piper CLI parameter propagation
- Supertonic pause handling
- Polly readiness status behavior
- utility and route-level regressions
Check:
espeak-ngis installed and available onPATH- the backend status badge is not showing a missing dependency problem
Check:
ffmpegis installed- your shell can run
ffmpeg -version
Check:
uv sync --extra piperhas been run if you want the Python Piper backend- if the app falls back to the CLI,
piperis installed and available onPATH - your shell can run
piper --help
Check:
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_REGION- outbound access to AWS STS / Polly
The app validates Polly readiness with a lightweight AWS call, so auth or network problems show up before conversion starts.
Check:
- the configured
.onnxfile exists - the sidecar
.jsonexists next to it PIPER_BINARYpoints to a real binary if the Python API falls back to CLI- your CLI supports
--noise-scaleand--noise-w
Check:
- the EPUB has a valid content spine
- the relevant spine items contain real text, not only images or decorative markup
That usually means the PDF is scanned pages rather than embedded text. This app does not currently perform OCR.
Check:
HF_TOKENis set- the model ID exists
- you are not hitting free-tier throttling
Check:
- the audio file was uploaded on the Voice Cloning page
- it appears in the saved reference voice list
- the selected filename still exists in
reference_audio/
Several backends download large assets the first time you use them:
- Kokoro: model download
- Supertonic: about 305MB
- XTTS / XTTS-RO: very large, around 1.8GB-class downloads
- SpeechT5: model + vocoder + embeddings
- Hugging Face local backends: model dependent
If a backend is slow the first time, that is expected.
Model caches are redirected into a repo-local .cache/ directory so experiments stay self-contained.
For Polly conversions, the app tracks billed character counts and estimated cost from backend pricing constants. This is useful for quick budgeting, but it is not a billing statement.
- No OCR for scanned PDFs
- Frontend and styling live in a single template file by design
- Backends are powerful but heavy; some require significant first-run downloads
- Chapter detection is heuristic for many PDFs
- Romanian support is narrower than English support
If you want a practical starting point:
- use Kokoro for local English
- use XTTS-RO or HF Romanian models for Romanian
- use chapter mode for long books where navigation matters
- save important runs into
output/ - keep a few short, clean reference clips in
reference_audio/
Do not commit your real .env.
Keep:
- AWS credentials
- Hugging Face tokens
- any other private values
only in local, untracked environment files.
This project is licensed under the GNU Affero General Public License v3.0.
That means if you modify it and make the modified version available to users over a network, you must also make the corresponding source available under the same license.
See LICENSE for the full text.

