A comprehensive multilingual dubbing and audio processing toolkit
β οΈ Work in Progress - This project is currently under active development and may contain incomplete features or unstable functionality.
Dub-Suite is an advanced Python-based toolkit designed for automated video dubbing and audio processing. It provides a complete pipeline for extracting audio from videos, performing speech-to-text transcription, translating content, and generating multilingual voice dubbing with speaker diarization and voice cloning capabilities.
- Video Audio Extraction: Extract high-quality audio from video files using FFmpeg
- Speech-to-Text Transcription: Powered by Whisper for accurate transcription
- Language Detection: Automatic detection of source audio language
- Audio Source Separation: Separate vocals from background music/SFX using Demucs
- Speaker Diarization: Identify and separate different speakers in audio
- Translation: Translate transcriptions between languages using Argos Translate
- Voice Cloning: Extract reference audio samples for voice synthesis
- Audio Processing Utilities: Format conversion and optimization tools
- Text-to-Speech Generation: Multilingual TTS with voice cloning
- Audio Mixing and Merging: Combine generated dubbing with original SFX
- Batch Processing: Process multiple videos simultaneously
- Web Interface: User-friendly GUI for non-technical users
- Quality Enhancement: Advanced audio processing and noise reduction
Dub-Suite/
βββ modules/
β βββ audio/ # Audio processing modules
β β βββ transcribe.py # Whisper-based transcription
β β βββ diarize_speakers.py # Speaker diarization
β β βββ generate_tts.py # TTS generation (WIP)
β βββ utils/ # Utility modules
β β βββ video_audio_extractor.py # Video to audio extraction
β β βββ language_detector.py # Language detection
β β βββ separate_sfx_vocals.py # Audio source separation
β β βββ translate.py # Translation services
β β βββ audio_utils.py # Audio format utilities
β β βββ align_speaker_with_transcription.py # Speaker alignment
β βββ voice/ # Voice cloning modules
β βββ extract_reference.py # Reference audio extraction
βββ **main**.py # Main application entry point
βββ requirements.txt # Python dependencies
- Python 3.9-3.12 (Python 3.13+ not fully supported yet)
- FFmpeg installed and accessible in system PATH
- CUDA-compatible GPU (optional, for faster processing)
These instructions will come soon.
python -m dub_suite --video path/to/video.mp4 --language esTranscriber: Whisper-based speech-to-text with word-level timestampsSpeakerDiarization: Identify and segment different speakersDemucsSFXSeparator: Separate vocals from background audioAudioUtils: Format conversion and audio optimization
LanguageDetector: Automatic language detection for audio filesTranslator: Multi-language translation using Argos TranslateSpeakerTranscriptionAligner: Align speaker segments with transcription
VideoAudioExtractor: Extract audio from video files with multiple quality optionsReferenceExtractor: Extract speaker reference samples for voice cloning
# Core TTS and Audio Processing
faster-whisper>=1.0.0
torch>=2.0.0
librosa>=0.10.0
soundfile>=0.12.0
pydub>=0.25.0
# Video Processing
ffmpy>=0.3.0
# Language Processing
argostranslate>=1.9.0
langdetect>=1.0.9
# Speaker Processing
pyannote.audio>=3.0.0
demucs>=4.0.0
# Utilities
numpy>=1.24.0
scipy>=1.10.0- CPU: Multi-core processor (Intel i7/AMD Ryzen 7 or higher)
- RAM: 16GB+ recommended for processing large video files
- GPU: NVIDIA GPU with 8GB+ VRAM for faster Whisper transcription
- Storage: SSD recommended for temporary file processing
- Video: MP4, AVI, MOV, MKV, WebM
- Audio: WAV, MP3, FLAC, M4A, OGG
- Audio: WAV (16kHz mono for Whisper, 48kHz for high-quality)
- Transcription: JSON with word-level timestamps
- Translation: Text with preserved timing information
- Video audio extraction with multiple quality options
- Whisper-based transcription with word timestamps
- Language detection and translation
- Speaker diarization and separation
- Audio source separation (vocals/SFX)
- Reference audio extraction for voice cloning
- Text-to-speech generation with voice cloning
- Speaker alignment with transcription
- Support for M&E (Music & Effects) with separate vocal tracks in video files
- Audio mixing and merging with original SFX
- Batch processing capabilities
- Error handling and logging improvements
- Web-based user interface
- Cloud deployment options
- Performance optimizations
- Advanced audio processing features (noise reduction, EQ)
Source-Available License - Limited Rights
Copyright Β© 2025 Nicolas St-Amour
This software is released under a custom source-available license. You are free to view, study, and use the source code for personal, non-commercial purposes only.
- β You can view and study the source code
- β You can use it for personal, non-commercial projects
- β You can report bugs and contribute suggestions
- β You can fork for personal learning (no redistribution)
- β You cannot use it commercially without permission
- β You cannot redistribute modified versions
- β You cannot create competing products based on this code
- β You cannot sublicense or sell the software
We welcome bug reports, feature suggestions, and code contributions through GitHub issues and pull requests. By contributing, you agree that your submissions will be licensed under the same terms as this project.
Full license terms are available in the LICENSE file.
- Faster-Whisper: For speech-to-text transcription
- Demucs: For audio source separation
- Pyannote: For speaker diarization
- FFmpeg: For video/audio processing
Note: This software is in active development. Features may change, and stability is not guaranteed. Use at your own discretion for production applications.
For commercial licensing, custom modifications, or permission to use this software in ways not permitted by this license, please contact: nstamour.dev@outlook.com