Whisper Streaming WebSocket

Real-time speech-to-text transcription and translation system built exclusively with faster-whisper for optimal performance. Features WebSocket support for live audio streaming and optional Google Translate integration.

Installation

Required Dependencies

pip install librosa soundfile websockets

Faster-Whisper Backend

This project uses faster-whisper exclusively for the best balance of speed and accuracy:

pip install faster-whisper

For GPU Support (Recommended):

Install NVIDIA libraries: CUDNN 8.5.0 and CUDA 11.7
Navigate to whisper_online.py and uncomment the GPU model line (around line 119)
GPU provides significantly faster real-time processing

Optional Features

Voice Activity Detection: pip install torch torchaudio
Translation: pip install requests (for Google Translate integration)

Usage

Quick Start - WebSocket Server

The easiest way to get started is using the enhanced launcher:

# Basic transcription only
python start_whisper.py

# With translation enabled
python start_whisper.py --translate --vac --vad

# Production setup
python start_whisper.py --host 0.0.0.0 --port 43007 --model large-v3 --translate --vac --vad

Then open index.html in your browser or navigate to http://localhost:43007 for the web interface.

Command Line Options

start_whisper.py

python start_whisper.py --help

Server Configuration:

--host HOST - Server host address (default: 0.0.0.0)
- Use 0.0.0.0 to accept connections from any IP
- Use localhost for local-only access
--port PORT - Server port number (default: 43007)
- Choose any available port for the WebSocket server

Model Settings:

--model MODEL - Whisper model size (default: large-v3)
- Options: tiny, base, small, medium, large-v1, large-v2, large-v3, large-v3-turbo
- Larger models = better accuracy but slower processing
- large-v3 recommended for production use
--language LANG - Source language code (default: en)
- Use auto for automatic language detection
- Use specific codes: en, es, fr, de, it, pt, ja, ko, zh, etc.

Audio Processing:

--chunk-size SIZE - Minimum audio chunk size in seconds (default: 0.3)
- Smaller values = lower latency but more processing overhead
- Larger values = higher latency but more efficient processing
--vac - Enable Voice Activity Controller (recommended)
- Automatically detects speech start/end for better processing
- Reduces unnecessary processing during silence
--vad - Enable Voice Activity Detection
- Filters out non-speech audio segments
- Improves accuracy by focusing on actual speech

Features:

--translate - Enable real-time translation using Google Translate
- Requires requests library: pip install requests
- Provides live translation alongside transcription

Advanced Options:

--warmup-file FILE - Audio file to warm up Whisper model
- Pre-loads model with sample audio for faster first chunk processing
- Use any WAV file for warming up
--log-level LEVEL - Logging verbosity (default: INFO)
- Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
- Use DEBUG for troubleshooting

whisper_online.py (File Processing)

Process pre-recorded audio files for testing and development:

python whisper_online.py demo.wav --language en --min-chunk-size 1

File Processing Options:

--model MODEL - Whisper model size (tiny to large-v3)
--language LANG - Source language or "auto" for automatic detection
--min-chunk-size SIZE - Minimum chunk size for processing (seconds)
--vac - Enable Voice Activity Controller for better segmentation
--vad - Enable Voice Activity Detection to filter non-speech
--task TASK - Processing task: "transcribe" or "translate" (to English)
--buffer-trimming - Buffer management strategy for long audio files

Web Interface Features

Language Selection: Choose from 25+ supported languages for source and target
Real-time Language Switching: Change languages during active recording without interruption
Voice Activity Indicator: Visual feedback showing when speech is detected
Dual Output Panels: Separate displays for original transcription and translation
WebSocket Connection: Low-latency real-time audio streaming
Auto-scroll: Automatically scrolls to show latest transcription results
Responsive Design: Works on desktop and mobile devices

Chrome Extension

The included Chrome extension captures audio from any browser tab and provides real-time transcription and translation:

Features:

Tab Audio Capture: Capture audio from YouTube, podcasts, videos, any web content
Side Panel Interface: Persistent panel that stays open while browsing
Real-time Processing: Live transcription and translation as audio plays
Language Controls: Change source and target languages during recording
Text Accumulation: Keeps all transcriptions for easy reading

Installation:

Load the extension from chrome_extension/ folder in Chrome Developer Mode
Start the Whisper server: python start_whisper.py --translate
Click extension icon to open side panel
Navigate to any tab with audio content and start translation

As a Python Module

from whisper_online import *

# Initialize faster-whisper ASR
asr = FasterWhisperASR("en", "large-v3")
asr.use_vad()  # Enable voice activity detection

# Create online processor
online = OnlineASRProcessor(asr)

# Process audio chunks in real-time
while audio_available:
    audio_chunk = get_audio_chunk()  # Your audio source (16kHz, float32)
    online.insert_audio_chunk(audio_chunk)
    result = online.process_iter()
    
    if result[2]:  # If there's transcribed text
        start_time, end_time, text = result
        print(f"[{start_time:.2f}s - {end_time:.2f}s]: {text}")

# Get final result when done
final_result = online.finish()
if final_result[2]:
    print(f"Final: {final_result[2]}")

Project Structure

├── whisper_online.py           # Core streaming processor with faster-whisper backend
├── whisper_websocket_server.py # WebSocket server with translation support
├── start_whisper.py            # Enhanced launcher script with all options
├── index.html                  # Web interface for real-time transcription
├── silero_vad_iterator.py      # Voice Activity Detection implementation
├── chrome_extension/           # Chrome extension for tab audio capture
│   ├── manifest.json          # Extension configuration
│   ├── sidepanel.html         # Extension UI
│   ├── sidepanel.js           # Extension logic
│   ├── service-worker.js      # Background script
│   └── offscreen.js           # Audio processing
├── frontend/                   # Documentation website
│   ├── index.html             # Project documentation
│   └── styles.css             # Documentation styling
└── demo.wav                    # Sample audio file for testing

Features

Real-time Streaming: Processes audio chunks with low latency using WebSocket protocol
Faster-Whisper Backend: Exclusively uses faster-whisper for optimal speed and accuracy
Voice Activity Detection: Smart processing with VAD/VAC to reduce computational overhead
Live Translation: Optional Google Translate integration for multilingual support
Web Interface: Browser-based real-time transcription with responsive design
Chrome Extension: Capture and translate audio from any browser tab
Language Switching: Change source and target languages during active recording
GPU Acceleration: Support for NVIDIA GPU acceleration for faster processing
Flexible Configuration: Extensive command-line options for customization

Supported Languages

Source Languages (Whisper): All 99 Whisper-supported languages including:

European: English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Turkish, Swedish, Danish, Norwegian, Finnish, Czech, Slovak, Hungarian, Romanian, Bulgarian, Croatian, Slovenian, Estonian, Latvian, Lithuanian, Ukrainian
Asian: Japanese, Korean, Chinese (Mandarin), Hindi, Arabic, Russian, Thai, Vietnamese, Indonesian, Malay
Others: Hebrew, Persian, Urdu, Bengali, Tamil, Telugu, Gujarati, Marathi, and many more

Translation Languages (Google Translate): 100+ languages supported for real-time translation

Language Detection: Automatic language detection available with --language auto option

Performance Notes

GPU vs CPU: GPU acceleration provides 4-5x faster processing than CPU-only
Model Size Impact:
- tiny: Fastest, lower accuracy
- base/small: Good balance for real-time use
- medium: Better accuracy, moderate speed
- large-v3: Best accuracy, requires more resources
Chunk Size: Smaller chunks (0.1-0.3s) = lower latency, larger chunks (0.5-1.0s) = better accuracy
VAD Benefits: Reduces processing by 30-50% by skipping silence periods

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Whisper Streaming WebSocket

Installation

Required Dependencies

Faster-Whisper Backend

Optional Features

Usage

Quick Start - WebSocket Server

Command Line Options

start_whisper.py

whisper_online.py (File Processing)

Web Interface Features

Chrome Extension

As a Python Module

Project Structure

Features

Supported Languages

Performance Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
chrome_extension		chrome_extension
docs		docs
whisper		whisper
.gitignore		.gitignore
README.md		README.md
demo.wav		demo.wav
index.html		index.html
silero_vad_iterator.py		silero_vad_iterator.py
start_whisper.py		start_whisper.py
whisper_online.py		whisper_online.py
whisper_websocket_server.py		whisper_websocket_server.py

Folders and files

Latest commit

History

Repository files navigation

Whisper Streaming WebSocket

Installation

Required Dependencies

Faster-Whisper Backend

Optional Features

Usage

Quick Start - WebSocket Server

Command Line Options

start_whisper.py

whisper_online.py (File Processing)

Web Interface Features

Chrome Extension

As a Python Module

Project Structure

Features

Supported Languages

Performance Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages