MoE Expert Load Balance Metrics & Visualization #1266
hemildesai
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Automodel just shipped built-in MoE load balance monitoring with wandb integration in #1232. Add a few lines to your wandb enabled YAML config and get per-expert routing visibility during training — no code changes required.
Quick Start
or
That's it. Works with
train_ft,benchmark, and any custom MoE model we support (DeepSeek-V3, Moonlight, Qwen3-MoE, etc.).What You Get
All metrics are flat scalars logged via
wandb.log()— they render as native wandb line charts with no extra setup.Brief Mode — Aggregated scalars + outlier experts
CV (Coefficient of Variation) across all MoE layers:
moe/cv_meanmoe/cv_medianmoe/cv_min/moe/cv_maxExpert utilization distribution — ratio of
actual_load / ideal_loadacross all experts globally (1.0 = fair share, >1 = overloaded, 0 = dead expert):moe/expert_utilization_minmoe/expert_utilization_p25moe/expert_utilization_medianmoe/expert_utilization_p75moe/expert_utilization_maxTop-K / Bottom-K outlier experts — the 5 most overloaded and 5 most underloaded experts across the entire model, with their layer and expert ID in the key:
moe_expert_utilization/layer_3_expert_41moe_expert_utilization/layer_22_expert_7Aux loss:
moe/aux_loss_mean— load balancing auxiliary loss averaged across layers (whenaux_loss_coeff > 0).Detailed Mode — Per-layer breakdowns
Everything in brief, plus per-layer scalars logged every
detailed_every_steps:moe/layer_{i}/cvmoe/layer_{i}/aux_lossmoe/layer_{i}/utilization_meanThis lets you identify which layers have routing problems — e.g. early layers balanced but deep layers collapsed.
How It Works
The implementation is minimal — the
Gate(router) already computes per-expert token counts for aux loss and bias correction. We just surface that data:Gate._track_load_balance— a boolean flag on the router. When enabled, storesexpert_load.detach()from each forward pass. Cost: one.detach()per layer — negligible vs expert compute.load_balance_metrics.py— stateless utility that traverses the model, collects loads from all Gate modules, and computes metrics. Supports optional DP all-reduce for global routing stats across ranks.log_train_metrics()picks up the config and calls the utility. Gated behindmoe_metrics.enabled, only runs on rank 0.No new dependencies. No changes to the forward/backward computation graph.
Example: Hellaswag finetuning
Here's an actual wandb dashboard comparing three finetuning runs using these metrics:
Three runs finetuning on HellaSwag — Moonlight 16B with gate bias updates (orange), Moonlight 16B without bias updates (green) and Qwen3 MoE 30B with 128 experts, no bias (purple):
Both Moonlight runs start with the same routing distribution — but only the orange run (with bias correction) moves closer to perfectly balanced experts. Green (without bias) stays closer to the original distribution.
Qwen3 30B tells a different story. Purple's utilization stays non-uniform throughout — p75 elevated, p25 depressed.
Use Cases
e_score_correction_biassteers load balance over timePlease share your thoughts and feedback, and let us know if you would like more MoE specific metrics supported.
Beta Was this translation helpful? Give feedback.
All reactions