A multimodal English ↔ Twi model trained with a LeJEPA-style objective (predictive MSE + SIGReg on encoder pools).
Learns a shared latent space across text and audio in both languages by training a predictor to anticipate target representations from context — no reconstruction loss, no cascaded pipeline.
Three training objectives:
- A — Audio → Text (both languages)
- B — Text → Text (translation)
- C — Text → Audio (both languages)
| File | Purpose |
|---|---|
model.py |
MMT_JEPA (shared encoder + predictor) |
sigreg.py |
SIGReg loss (LeJEPA) |
dataset.py |
ObjA, ObjB, ObjC dataset classes |
tokenizer.py |
Trains a joint BPE tokenizer on all text data |
train.py |
SSL pretraining (all objectives) |
train_decoder.py |
Decoder fine-tuning on frozen or tunable JEPA |
pip install torch torchaudio soundfile sentencepiece datasets1. Train the tokenizer
python tokenizer.py
# outputs: tokenizer.model, tokenizer.vocab2. Train the model
python train.pyCheckpoints saved to checkpoints/epoch{N}.pt after each epoch.
| Objective | Dataset |
|---|---|
| A + C (English audio) | LibriSpeech train-clean-100 |
| A + C (Twi audio) | twi-speech-text-multispeaker-16k |
| B (translation) | twi-english-paragraph-dataset_news · english-twi-sentences-non-nouns · english-twi-nouns-v2 |
All datasets load automatically via HuggingFace on first run.
Edit ModelConfig in model.py to change capacity:
d_model = 512 # embedding dimension
trunk_layers = 6 # shared transformer depth
vocab_size = 16_000
n_mels = 80
sample_rate = 16_000
sigreg_lambda = 0.02 # LeJEPA trade-off (TinyMMT_JEPAConfig in config.py for training runs)- L2-normalise pooled predictions and targets before MSE; SIGReg runs on raw pooled ctx/tgt stacks
- Loss:
(1 - λ) · MSE + λ · SIGRegwithλ = sigreg_lambda - Possible
COLLAPSElog when embeddingstdis tiny or cosine similarity is near 1