Book to Audiobook Converter

Turn PDFs and EPUBs into spoken audio from a single local web app. Pick a page range, choose a TTS engine, preview what is available, and export either one MP3 or a ZIP of chapter files.

This repo is optimized for practical use, not as a framework. It gives you one place to:

convert books from input/ or direct uploads
switch between local and cloud speech engines
clone voices with reference audio
save outputs back into the repo for repeatable runs
come back later and still remember how the whole thing works

Why This Exists

Most TTS tools make you choose between flexibility and convenience. This app keeps both:

a clean browser UI instead of one-off scripts
multiple backends instead of one engine lock-in
chapter-aware conversion instead of only full-book output
local-first options when you want privacy or offline experimentation
cloud backends when you want convenience or specific voices

What You Can Do

Capability	Details
Input formats	PDF and EPUB
Source options	Upload from your machine or pick files from `input/`
Output formats	Single MP3 or chapter ZIP
Save mode	Download immediately or save into `output/`
Voice cloning	Upload reference audio and synthesize with XTTS
Chapter mode	Works for both PDF and EPUB
Progress tracking	Browser polls server-side job progress during conversion
Language filtering	UI can filter backends by English / Romanian

Backend Matrix

Backend	Type	Languages	Best For	First Run / Dependency Notes
Kokoro	Local	English	Best default local quality-to-speed balance	Requires `espeak-ng`; downloads model on first use
Piper	Local	English	Lightweight local ONNX voices and tunable pacing	Needs a `.onnx` model plus sidecar JSON in `models/`
Supertonic	Local	English	ONNX-based local synthesis with simple controls	Downloads about 305MB on first use
Hugging Face	Local	English, Romanian	Trying HF TTS models locally	Downloads model weights on first use
SpeechT5	Local	English	Multi-speaker preset voices	Downloads model, vocoder, and speaker embeddings
XTTS-v2	Local	English + multilingual	Voice cloning with a reference sample	Large model download, reference audio required
XTTS-v2 Romanian	Local	Romanian	Romanian voice cloning	Large Romanian fine-tune download, reference audio required
Amazon Polly	Cloud	English in current UI	Reliable cloud voices and billing visibility	Requires valid AWS credentials
HF Inference API	Cloud	English, Romanian	Quick cloud inference experiments	Requires `HF_TOKEN`; free tier can rate-limit

Quick Start

The shortest clean setup is two commands:

./scripts/bootstrap.sh
uv run python app.py

Then open http://localhost:1234.

If you want every optional backend as well:

./scripts/bootstrap.sh --all
uv run python app.py

What the bootstrap script does

./scripts/bootstrap.sh:

installs uv if missing
installs ffmpeg and espeak-ng
installs Python 3.12 via uv
runs uv sync --frozen --dev
creates .env from .env.example if needed

It currently supports:

macOS with Homebrew
Ubuntu / Debian style systems with apt-get

Manual setup

If you prefer not to use the bootstrap script, the equivalent manual flow is below.

1. Install `uv`

macOS or Ubuntu:

curl -LsSf https://astral.sh/uv/install.sh | sh

If you already use Homebrew on macOS:

brew install uv

2. Pin the project Python version

This repo is set up for Python 3.12 because several TTS backends are still fragile on newer interpreters.

uv python install 3.12

The repo includes .python-version, so uv will use Python 3.12 automatically.

3. Install non-Python system dependencies

This project needs a couple of tools that are outside Python dependency management.

Tool	Required?	Why
`ffmpeg`	Yes	`pydub` uses it for MP3 import/export and audio format conversion
`espeak-ng`	Required for Kokoro	Kokoro depends on it for phonemization
`piper` CLI	Optional	Only needed if Piper falls back from the Python API to the CLI binary

pydub relies on ffmpeg for MP3 import/export, and Kokoro relies on espeak-ng.

macOS:

brew install ffmpeg espeak-ng

If you want the optional Piper CLI available as a fallback:

brew install piper

Ubuntu / Debian:

sudo apt-get update
sudo apt-get install -y ffmpeg espeak-ng

If you want the optional Piper CLI available as a fallback:

sudo apt-get install -y piper

Quick sanity check:

ffmpeg -version
espeak-ng --version

4. Sync Python dependencies with `uv`

For the default app experience, including the web app, Kokoro, Polly support, and the fast test suite:

uv sync --frozen --dev

If you want every optional backend available locally:

uv sync --frozen --dev --extra all

Optional backend extras

Extra	Adds
`piper`	Piper Python backend support
`supertonic`	Supertonic local ONNX backend
`huggingface`	Local Transformers TTS backend
`xtts`	XTTS / XTTS-RO voice cloning backends
`speecht5`	SpeechT5 multi-speaker backend
`ocr`	EasyOCR fallback for scanned PDFs
`all`	All optional backends above

Examples:

uv sync --frozen --dev --extra piper --extra supertonic
uv sync --frozen --dev --extra xtts --extra ocr

5. Create your local environment file

cp .env.example .env

You do not need to fill everything in up front. For a first successful run, the minimum usually is:

leave the default TTS_BACKEND=kokoro
keep Kokoro defaults as-is
skip cloud credentials until you want Polly or HF cloud

6. Start the app

uv run python app.py

Open:

http://localhost:1234

7. First successful conversion

Open the Convert page.
Upload a PDF or EPUB, or place one inside input/ and pick it from the dropdown.
Keep Kokoro selected for the simplest local path.
Leave page range as all or choose a range such as 1-10.
Click Convert to MP3.

Common `uv` Commands

./scripts/bootstrap.sh
./scripts/bootstrap.sh --all
uv sync --frozen --dev
uv sync --frozen --dev --extra all
uv lock
uv sync --frozen --dev
uv run python app.py
uv run pytest tests/test_smoke.py -q

Use uv lock only when you intentionally want to update the pinned dependency set.

Configuration

The full configuration surface lives in ./.env.example.

Recommended workflow:

cp .env.example .env

Then edit only the values you care about.

Most Important Settings

Setting	Why It Matters
`TTS_BACKEND`	Sets the default backend shown on page load
`AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` / `AWS_REGION`	Required for Polly readiness and conversion
`PIPER_MODEL`	Points to the default Piper voice model
`HF_TOKEN`	Required for HF cloud inference
`KOKORO_VOICE` / `KOKORO_SPEED`	Sets the default local voice and pace
`SUPERTONIC_*`	Sets default Supertonic voice, language, speed, and silence

Config Philosophy

If a variable is not in .env.example, the app does not currently rely on it.
.env.example uses safe placeholders and code defaults only.
Keep real secrets in .env only.
The app loads .env automatically through python-dotenv.

How To Use The App

Convert Page

The main workflow is a three-step wizard:

Source
- Upload a .pdf or .epub
- Or pick a file already present in input/
Pages
- Use all, 5, or 1-10
- The app fetches metadata and tries to detect chapters
- If there are blank front/back pages, the UI suggests a likely text range
Engine & Voice
- Pick a backend
- The UI loads voice lists dynamically where needed
- The status badge shows whether a backend is ready or still needs setup/downloads

Chapter Mode

If chapters are detected, you can enable Split into chapter files.

Important behavior:

chapter mode converts the whole document by detected chapters
the result is a ZIP archive of MP3 files
this works for both PDF and EPUB
the typed page range is not the controlling unit when chapter mode is on

Save Mode

Enable Save to output/ folder if you want the results kept in the repo instead of downloaded directly by the browser.

Voice Cloning Page

Use the Voice Cloning page to upload short reference audio:

WAV, MP3, M4A, FLAC, OGG
max 10MB
3 to 10 seconds is ideal

Those files are stored in reference_audio/ and become selectable when using XTTS backends.

Supported Inputs And Outputs

Inputs

PDF with extractable text
EPUB with a valid content spine
reference audio for XTTS voice cloning

Outputs

single MP3
ZIP of per-chapter MP3 files
optional saved artifacts under output/

Output Naming

Generated filenames include:

UTC timestamp
source book name
page or chapter label
backend
voice/model detail
relevant prosody settings when applicable

Project Layout

.
├── app.py
├── pyproject.toml
├── .python-version
├── scripts/
│   └── bootstrap.sh
├── templates/
│   └── index.html
├── input/
├── output/
├── reference_audio/
├── tests/
├── models/
├── .env.example
└── uv.lock

What each piece does:

app.py: the full Flask app, document extraction pipeline, backend dispatch, and routes
pyproject.toml: dependency definitions, optional backend extras, and pytest config for uv
.python-version: pins the interpreter version used by uv
scripts/bootstrap.sh: one-command local setup for supported macOS and Ubuntu/Debian environments
templates/index.html: the UI, styling, and browser-side JavaScript in one template
input/: books you want selectable from the dropdown
output/: saved MP3s and chapter ZIP directories
reference_audio/: uploaded voice-cloning samples
models/: Piper .onnx models and their .json sidecars
tests/: fast regression coverage plus focused route and extraction tests
uv.lock: fully resolved dependency lockfile generated by uv

Architecture At A Glance

The app is intentionally simple:

Flask server
- serves the UI
- accepts uploads and convert requests
- exposes helper APIs for backend status, chapter metadata, and saved reference voices
Inline frontend
- HTML, CSS, and JavaScript live in templates/index.html
- dynamic UI behavior uses fetch
- progress is polled from /progress/<job_id>
Document processing
- PDFs are read with PyMuPDF / fitz
- EPUBs are parsed from the content spine
- extracted text is cleaned before TTS
Speech synthesis
- backend-specific wrappers normalize output into MP3 buffers
- heavy models are cached lazily in process
- chapter mode writes one file per chapter, then zips when needed
Local cache management
- Hugging Face and Supertonic caches are redirected into a project-local .cache/

API Routes Worth Knowing

Route	Purpose
`/`	Main app UI
`/convert`	Starts a conversion job and returns MP3, ZIP, or saved-result JSON
`/progress/<job_id>`	Current extraction / synthesis progress
`/api/backend-status`	Readiness notes for local and cloud backends
`/api/pdf-info`	Page and chapter metadata for PDFs and EPUBs
`/api/reference-voices`	Lists uploaded XTTS reference samples
`/api/upload-reference`	Uploads a new reference audio file

Text Processing Notes

Before synthesis, the app tries to make extracted book text sound better:

joins hard-wrapped lines
expands common abbreviations
converts some numbers to words
normalizes ligatures and punctuation
strips URLs and citation-style brackets
preserves normal bracketed prose such as [aside]
detects and strips likely running headers in PDFs

This is a practical cleanup layer, not a perfect document normalizer.

Testing

Run the fast regression suite:

uv run pytest tests/test_smoke.py -q

The current tests focus on:

non-destructive text preprocessing
EPUB chapter support
Piper CLI parameter propagation
Supertonic pause handling
Polly readiness status behavior
utility and route-level regressions

Troubleshooting

Kokoro fails or sounds unavailable

Check:

espeak-ng is installed and available on PATH
the backend status badge is not showing a missing dependency problem

MP3 export or audio decode fails

Check:

ffmpeg is installed
your shell can run ffmpeg -version

Piper works inconsistently or fallback errors mention the CLI

Check:

uv sync --extra piper has been run if you want the Python Piper backend
if the app falls back to the CLI, piper is installed and available on PATH
your shell can run piper --help

Polly shows as not ready

Check:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_REGION
outbound access to AWS STS / Polly

The app validates Polly readiness with a lightweight AWS call, so auth or network problems show up before conversion starts.

Piper conversion fails

Check:

the configured .onnx file exists
the sidecar .json exists next to it
PIPER_BINARY points to a real binary if the Python API falls back to CLI
your CLI supports --noise-scale and --noise-w

EPUB conversion says no readable content

Check:

the EPUB has a valid content spine
the relevant spine items contain real text, not only images or decorative markup

PDF conversion says no extractable text

That usually means the PDF is scanned pages rather than embedded text. This app does not currently perform OCR.

HF cloud errors or rate limits

Check:

HF_TOKEN is set
the model ID exists
you are not hitting free-tier throttling

XTTS says reference audio is missing

Check:

the audio file was uploaded on the Voice Cloning page
it appears in the saved reference voice list
the selected filename still exists in reference_audio/

Operational Notes

First-Run Downloads

Several backends download large assets the first time you use them:

Kokoro: model download
Supertonic: about 305MB
XTTS / XTTS-RO: very large, around 1.8GB-class downloads
SpeechT5: model + vocoder + embeddings
Hugging Face local backends: model dependent

If a backend is slow the first time, that is expected.

Local-First Caches

Model caches are redirected into a repo-local .cache/ directory so experiments stay self-contained.

Polly Cost Visibility

For Polly conversions, the app tracks billed character counts and estimated cost from backend pricing constants. This is useful for quick budgeting, but it is not a billing statement.

Limitations

No OCR for scanned PDFs
Frontend and styling live in a single template file by design
Backends are powerful but heavy; some require significant first-run downloads
Chapter detection is heuristic for many PDFs
Romanian support is narrower than English support

Good Defaults

If you want a practical starting point:

use Kokoro for local English
use XTTS-RO or HF Romanian models for Romanian
use chapter mode for long books where navigation matters
save important runs into output/
keep a few short, clean reference clips in reference_audio/

Security Reminder

Do not commit your real .env.

Keep:

AWS credentials
Hugging Face tokens
any other private values

only in local, untracked environment files.

License

This project is licensed under the GNU Affero General Public License v3.0.

That means if you modify it and make the modified version available to users over a network, you must also make the corresponding source available under the same license.

See LICENSE for the full text.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
docs		docs
input		input
output		output
reference_audio		reference_audio
scripts		scripts
templates		templates
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Book to Audiobook Converter

Why This Exists

What You Can Do

Backend Matrix

Quick Start

What the bootstrap script does

Manual setup

1. Install uv

2. Pin the project Python version

3. Install non-Python system dependencies

4. Sync Python dependencies with uv

Optional backend extras

5. Create your local environment file

6. Start the app

7. First successful conversion

Common uv Commands

Configuration

Most Important Settings

Config Philosophy

How To Use The App

Convert Page

Chapter Mode

Save Mode

Voice Cloning Page

Supported Inputs And Outputs

Inputs

Outputs

Output Naming

Project Layout

Architecture At A Glance

API Routes Worth Knowing

Text Processing Notes

Testing

Troubleshooting

Kokoro fails or sounds unavailable

MP3 export or audio decode fails

Piper works inconsistently or fallback errors mention the CLI

Polly shows as not ready

Piper conversion fails

EPUB conversion says no readable content

PDF conversion says no extractable text

HF cloud errors or rate limits

XTTS says reference audio is missing

Operational Notes

First-Run Downloads

Local-First Caches

Polly Cost Visibility

Limitations

Good Defaults

Security Reminder

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

1. Install `uv`

4. Sync Python dependencies with `uv`

Common `uv` Commands