WeSpeaker finetuning

TL;DR: Practical, config-driven fine-tuning for WeSpeaker models (ResNet, ECAPA-TDNN, CAMPPlus, …) with:

Flexible layer-freezing strategies (exclude/only/none/all)
Pattern-based layer selection (by layer name substrings)
Optional embedding-dimension change during fine-tuning (e.g., 256→128)
Clean VoxCeleb-style recipes & runnable commands
Full fine-tuning guide.

Why this repo?

WeSpeaker already provides strong training pipelines and docs. This fork adds a focused set of fine-tuning utilities & recipes to adapt pretrained models to your target domains (new speakers, languages, channels) without re-training from scratch. See the upstream docs & paper for the broader toolkit.

Features

🔧 Layer freezing, your way — choose from multiple strategies and specify trainable/frozen parts via name patterns (e.g., layer4, pool, xvector.dense).
🧩 Architecture-aware — examples for ResNet (18–293), ECAPA-TDNN, CAMPPlus, etc.
🪄 Change embedding size during fine-tune (e.g., 256→128/64) to speed up deployment and reduce memory, while keeping upstream feature extractors intact.
🧪 Batteries included — VoxCeleb-style data layout, quickstart configs, and simple run scripts. Details and YAML snippets are in the fine-tuning guide.

Installation

Install for development & deployment

# Option A: use upstream pip install of wespeaker
pip install git+https://github.com/wenet-e2e/wespeaker.git

# Option B: dev env for this repo
# Create conda env: pytorch version >= 1.12.1 is recommended !!!
git clone https://github.com/wa3dbk/wespeaker-finetuning.git
cd wespeaker-finetuning
conda create -n wespeaker python=3.9
conda activate wespeaker
conda install pytorch=1.12.1 torchaudio=0.12.1 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt
pre-commit install  # for clean and tidy code

For general WeSpeaker usage and environment notes, see upstream docs.

Usage

Command-line usage (use -h for parameters):

$ wespeaker --task embedding --audio_file audio.wav --output_file embedding.txt
$ wespeaker --task embedding_kaldi --wav_scp wav.scp --output_file /path/to/embedding
$ wespeaker --task similarity --audio_file audio.wav --audio_file2 audio2.wav
$ wespeaker --task diarization --audio_file audio.wav

Python programming usage:

import wespeaker

model = wespeaker.load_model('chinese')
embedding = model.extract_embedding('audio.wav')
utt_names, embeddings = model.extract_embedding_list('wav.scp')
similarity = model.compute_similarity('audio1.wav', 'audio2.wav')
diar_result = model.diarize('audio.wav')

Please refer to python usage for more command line and python programming usage.

🔥 News

2025.10.14: Add the possibility to change embed_dim during model fine-tuning, see fine-tuning guidelines for more details.
2025.10.04: Add support for model fine-tuning, see fine-tuning guidelines for more details.
2025.02.23: Add support for the Xi-vector, see #404.
2024.09.03: Support the SimAM_ResNet and the model pretrained on VoxBlink2, check Pretrained Models for the pretrained model, VoxCeleb Recipe for the super performance, and python usage for the command line usage!
2024.08.30: We support whisper_encoder based frontend and propose the Whisper-PMFA framework, check #356.
2024.08.20: Update diarization recipe for VoxConverse dataset by leveraging umap dimensionality reduction and hdbscan clustering, see #347 and #352.
2024.08.18: Support using ssl pre-trained models as the frontend. The WavLM recipe is also provided, see #344.
2024.05.15: Add support for quality-aware score calibration, see #320.
2024.04.25: Add support for the gemini-dfresnet model, see #291.
2024.04.23: Support MNN inference engine in runtime, see #310.
2024.04.02: Release Wespeaker document with detailed model-training tutorials, introduction of various runtime platforms, etc.
2024.03.04: Support the eres2net-cn-common-200k and campplus-cn-common-200k of damo #281, check python usage for details.
2024.02.05: Support the ERes2Net #272 and Res2Net #273 models.
2023.11.13: Support CLI usage of wespeaker, check python usage for details.
2023.07.18: Support the kaldi-compatible PLDA and unsupervised adaptation, see #186.
2023.07.14: Support the NIST SRE16 recipe, see #177.

Recipes

VoxCeleb: Speaker Verification recipe on the VoxCeleb dataset
- 🔥 UPDATE 2024.05.15: We support score calibration for Voxceleb and achieve better performance!
- 🔥 UPDATE 2023.07.10: We support self-supervised learning recipe on Voxceleb! Achieving 2.627% (ECAPA_TDNN_GLOB_c1024) EER on vox1-O-clean test set without any labels.
- 🔥 UPDATE 2022.10.31: We support deep r-vector up to the 293-layer version! Achieving 0.447%/0.043 EER/mindcf on vox1-O-clean test set
- 🔥 UPDATE 2022.07.19: We apply the same setups as the CNCeleb recipe, and obtain SOTA performance considering the open-source systems
  - EER/minDCF on vox1-O-clean test set are 0.723%/0.069 (ResNet34) and 0.728%/0.099 (ECAPA_TDNN_GLOB_c1024), after LM fine-tuning and AS-Norm
CNCeleb: Speaker Verification recipe on the CnCeleb dataset
- 🔥 UPDATE 2024.05.16: We support score calibration for Cnceleb and achieve better EER.
- 🔥 UPDATE 2022.10.31: 221-layer ResNet achieves 5.655%/0.330 EER/minDCF
- 🔥 UPDATE 2022.07.12: We migrate the winner system of CNSRC 2022 report slides
  - EER/minDCF reduction from 8.426%/0.487 to 6.492%/0.354 after large margin fine-tuning and AS-Norm
NIST SRE16: Speaker Verification recipe for the 2016 NIST Speaker Recognition Evaluation Plan. Similar recipe can be found in Kaldi.
- 🔥 UPDATE 2023.07.14: We support NIST SRE16 recipe. After PLDA adaptation, we achieved 6.608%, 10.01%, and 2.974% EER on trial Pooled, Tagalog, and Cantonese, respectively.
VoxConverse: Diarization recipe on the VoxConverse dataset

Citations

If you find wespeaker useful, please cite it as

@article{wang2024advancing,
  title={Advancing speaker embedding learning: Wespeaker toolkit for research and production},
  author={Wang, Shuai and Chen, Zhengyang and Han, Bing and Wang, Hongji and Liang, Chengdong and Zhang, Binbin and Xiang, Xu and Ding, Wen and Rohdin, Johan and Silnova, Anna and others},
  journal={Speech Communication},
  volume={162},
  pages={103104},
  year={2024},
  publisher={Elsevier}
}

@inproceedings{wang2023wespeaker,
  title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
  author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

Looking for contributors

If you are interested to contribute, feel free to contact @wa3dbk

Name		Name	Last commit message	Last commit date
Latest commit History 331 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
runtime		runtime
tools		tools
wespeaker		wespeaker
.clang-format		.clang-format
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CPPLINT.cfg		CPPLINT.cfg
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WeSpeaker finetuning

Why this repo?

Features

Installation

Install for development & deployment

Usage

🔥 News

Recipes

Citations

Looking for contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Languages

Folders and files

Latest commit

History

Repository files navigation

WeSpeaker finetuning

Why this repo?

Features

Installation

Install for development & deployment

Usage

🔥 News

Recipes

Citations

Looking for contributors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 0

Languages

Packages

Contributors