MoE Expert Load Balance Metrics & Visualization #1266

hemildesai · 2026-02-13T01:12:02Z

hemildesai
Feb 13, 2026
Collaborator

Automodel just shipped built-in MoE load balance monitoring with wandb integration in #1232. Add a few lines to your wandb enabled YAML config and get per-expert routing visibility during training — no code changes required.

Quick Start

moe_metrics:
  enabled: true
  mode: detailed          # "brief" for aggregated scalars, "detailed" for per-layer breakdowns
  detailed_every_steps: 50

or

# CLI
  --moe_metrics.enabled true \
  --moe_metrics.mode brief \
  --moe_metrics.top_k_experts 20

That's it. Works with train_ft, benchmark, and any custom MoE model we support (DeepSeek-V3, Moonlight, Qwen3-MoE, etc.).

What You Get

All metrics are flat scalars logged via wandb.log() — they render as native wandb line charts with no extra setup.

Brief Mode — Aggregated scalars + outlier experts

CV (Coefficient of Variation) across all MoE layers:

Key	What it tells you
`moe/cv_mean`	Average routing imbalance. 0 = perfect.
`moe/cv_median`	Typical layer's imbalance (robust to outlier layers).
`moe/cv_min` / `moe/cv_max`	Best/worst balanced layers.

Expert utilization distribution — ratio of actual_load / ideal_load across all experts globally (1.0 = fair share, >1 = overloaded, 0 = dead expert):

Key	What it tells you
`moe/expert_utilization_min`	Least utilized expert. 0 = dead expert.
`moe/expert_utilization_p25`	25th percentile utilization.
`moe/expert_utilization_median`	Typical expert utilization. Should hover near 1.0.
`moe/expert_utilization_p75`	75th percentile utilization.
`moe/expert_utilization_max`	Most overloaded expert.

Top-K / Bottom-K outlier experts — the 5 most overloaded and 5 most underloaded experts across the entire model, with their layer and expert ID in the key:

Key	Example
`moe_expert_utilization/layer_3_expert_41`	0.12 (nearly dead)
`moe_expert_utilization/layer_22_expert_7`	2.85 (heavily overloaded)

Aux loss: moe/aux_loss_mean — load balancing auxiliary loss averaged across layers (when aux_loss_coeff > 0).

Detailed Mode — Per-layer breakdowns

Everything in brief, plus per-layer scalars logged every detailed_every_steps:

Key	What it tells you
`moe/layer_{i}/cv`	CV for a specific MoE layer.
`moe/layer_{i}/aux_loss`	Aux loss for that layer.
`moe/layer_{i}/utilization_mean`	Mean expert utilization for that layer.

This lets you identify which layers have routing problems — e.g. early layers balanced but deep layers collapsed.

How It Works

The implementation is minimal — the Gate (router) already computes per-expert token counts for aux loss and bias correction. We just surface that data:

Gate._track_load_balance — a boolean flag on the router. When enabled, stores expert_load.detach() from each forward pass. Cost: one .detach() per layer — negligible vs expert compute.
load_balance_metrics.py — stateless utility that traverses the model, collects loads from all Gate modules, and computes metrics. Supports optional DP all-reduce for global routing stats across ranks.
Recipe integration — log_train_metrics() picks up the config and calls the utility. Gated behind moe_metrics.enabled, only runs on rank 0.

No new dependencies. No changes to the forward/backward computation graph.

Example: Hellaswag finetuning

Here's an actual wandb dashboard comparing three finetuning runs using these metrics:

Three runs finetuning on HellaSwag — Moonlight 16B with gate bias updates (orange), Moonlight 16B without bias updates (green) and Qwen3 MoE 30B with 128 experts, no bias (purple):

Both Moonlight runs start with the same routing distribution — but only the orange run (with bias correction) moves closer to perfectly balanced experts. Green (without bias) stays closer to the original distribution.

Qwen3 30B tells a different story. Purple's utilization stays non-uniform throughout — p75 elevated, p25 depressed.

Use Cases

Detect expert collapse early — utilization median drops to 0 at step 100 instead of discovering dead experts after a full training run
Tune bias correction — monitor how e_score_correction_bias steers load balance over time
Compare parallelism strategies — metrics across EP/DP/CP configs help pick the right setup
Research — test new auxiliary losses or bias correction methods for load balancing.

Please share your thoughts and feedback, and let us know if you would like more MoE specific metrics supported.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE Expert Load Balance Metrics & Visualization #1266

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

MoE Expert Load Balance Metrics & Visualization #1266

Uh oh!

Uh oh!

hemildesai Feb 13, 2026 Collaborator

Quick Start

What You Get

Brief Mode — Aggregated scalars + outlier experts

Detailed Mode — Per-layer breakdowns

How It Works

Example: Hellaswag finetuning

Use Cases

Replies: 0 comments

hemildesai
Feb 13, 2026
Collaborator