NVIDIA
diff --git a/‎.claude/skills/common/slurm-setup.md‎
Lines changed: 66 additions & 0 deletions b/‎.claude/skills/common/slurm-setup.md‎
Lines changed: 66 additions & 0 deletions
diff --git a/‎.claude/skills/ptq/SKILL.md‎
Lines changed: 9 additions & 3 deletions b/‎.claude/skills/ptq/SKILL.md‎
Lines changed: 9 additions & 3 deletions
diff --git a/‎.claude/skills/ptq/references/slurm-setup-ptq.md‎
Lines changed: 7 additions & 0 deletions b/‎.claude/skills/ptq/references/slurm-setup-ptq.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎.claude/skills/ptq/references/unsupported-models.md‎
Lines changed: 23 additions & 19 deletions b/‎.claude/skills/ptq/references/unsupported-models.md‎
Lines changed: 23 additions & 19 deletions
diff --git a/‎.claude/skills/ptq/tests.json‎
Lines changed: 15 additions & 0 deletions b/‎.claude/skills/ptq/tests.json‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎CHANGELOG.rst‎
Lines changed: 1 addition & 1 deletion b/‎CHANGELOG.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CLAUDE.md‎
Lines changed: 11 additions & 1 deletion b/‎CLAUDE.md‎
Lines changed: 11 additions & 1 deletion
diff --git a/‎examples/windows/onnx_ptq/genai_llm/requirements.txt‎
Lines changed: 0 additions & 1 deletion b/‎examples/windows/onnx_ptq/genai_llm/requirements.txt‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎examples/windows/onnx_ptq/whisper/requirements.txt‎
Lines changed: 0 additions & 1 deletion b/‎examples/windows/onnx_ptq/whisper/requirements.txt‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎modelopt/onnx/__init__.py‎
Lines changed: 0 additions & 12 deletions b/‎modelopt/onnx/__init__.py‎
Lines changed: 0 additions & 12 deletions
@@ -74,6 +74,47 @@ include a multi-node-capable partition as the last fallback.
 
 Only submit the full job after the smoke test exits cleanly.
 
+### Docker (non-pyxis) variant
+
+Some clusters don't have pyxis/enroot installed and instead use plain `docker run` on compute nodes. In this case, replace the `srun --container-image` pattern with `docker run` inside the job script:
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=<name>
+#SBATCH --account=<account>
+#SBATCH --partition=<partition>
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=<N>
+#SBATCH --time=<HH:MM:SS>
+#SBATCH --output=<log_dir>/<name>_%j.log
+
+docker run --rm \
+    --gpus all \
+    --shm-size=32g \
+    --ulimit memlock=-1 \
+    --network host \
+    -v <data_root>:<data_root> \
+    -e CALIB_SIZE="${CALIB_SIZE:-512}" \
+    <container_image> \
+    bash <path/to/run_script.sh>
+```
+
+**Key differences from pyxis**:
+
+- No `srun` wrapper needed — SLURM just allocates the node, Docker runs the container
+- Mount paths with `-v` instead of `--container-mounts`
+- Pass env vars with `-e` instead of relying on SLURM env propagation
+- Use the two-script pattern: SLURM wrapper (sbatch directives + `docker run`) and inner runner (the actual work). The inner runner should unset SLURM env vars and set `HF_HOME`/`HF_DATASETS_OFFLINE` as needed
+- **NFS root_squash**: see section 5
+
+**How to detect which pattern to use**: Ask the user how they normally run containers, or check:
+
+```bash
+which enroot 2>/dev/null && echo "pyxis/enroot available"
+which docker 2>/dev/null && echo "docker available"
+```
+
 ---
 
 ## 3. Monitor Until Completion
@@ -126,3 +167,28 @@ srun \
 ```
 
 Adjust `--nodes`, `--gpus-per-node`, and the distributed launch command per your workload.
+
+---
+
+## 5. NFS root_squash and Docker Permissions
+
+Docker containers typically run as root, but NFS filesystems with `root_squash` (the default) map root to `nobody`, blocking writes to directories owned by the user. This causes `PermissionError` when creating cache lock files, writing output, or saving logs.
+
+This affects both pyxis/enroot (`srun --container-image`) and plain `docker run` workflows.
+
+**Preferred fix** — run Docker with the host user's UID/GID to match NFS ownership:
+
+```bash
+docker run --user $(id -u):$(id -g) ...
+```
+
+> Note: `--user` may cause issues if the container expects root for package installation. In that case, fall back to the chmod approach below.
+
+**Fallback fix** — open permissions before submitting the job:
+
+```bash
+chmod -R g+rwX /path/to/workspace/
+chmod -R g+rwX /path/to/.hf_cache/
+```
+
+Scope `chmod` to only the directories the job needs — avoid world-writable paths on shared clusters.
@@ -118,10 +118,16 @@ Report the path and size to the user.
 - `mtq.register()` classes **must** define `_setup()` and call it from `__init__`
 - Call `mto.enable_huggingface_checkpointing()` **before** quantization
 - Wildcard `*gate*` matches too broadly — use `*mlp.gate*` or `*router*`
-- VLMs need `AutoModel`, not `AutoModelForCausalLM`
-- FP8 loading: `FineGrainedFP8Config(dequantize=True)`, not a dict
+- VLMs: `hf_ptq.py` auto-extracts the language model via `extract_and_prepare_language_model_from_vl()` — no manual VLM handling needed in most cases
+- FP8 checkpoints: prefer `_QuantFP8Linear` (lazy dequant) over `FineGrainedFP8Config(dequantize=True)` which wastes ~2x memory. See `references/unsupported-models.md` for details
 - Custom quantizer names must end with `_input_quantizer` or `_weight_quantizer`
 
+## Common Pitfalls
+
+- **Transformers version**: Newer models (e.g., Devstral/ministral3) may require a transformers version not yet in the container. Check `config.json` for `transformers_version` and upgrade if needed. Install ModelOpt first, then upgrade transformers **with** deps (not `--no-deps`) to pull compatible `huggingface_hub`
+- **Gated datasets**: Some calibration datasets require HF authentication. Ensure `HF_TOKEN` is set in the job environment, or use `--dataset cnn_dailymail` as a non-gated alternative
+- **NFS root_squash + Docker**: Docker runs as root, but NFS squashes root to `nobody`. Use `docker run --user $(id -u):$(id -g)`, or `chmod -R a+rwX` on needed directories as a fallback. See `skills/common/slurm-setup.md` section 5
+
 ## References
 
 | Reference | When to read |
@@ -133,7 +139,7 @@ Report the path and size to the user.
 | `references/unsupported-models.md` | Step 4C only (unlisted model) |
 | `skills/common/remote-execution.md` | Step 4A/4C only, if target is remote |
 | `skills/common/slurm-setup.md` | Step 4A/4C only, if using SLURM manually (not launcher) |
-| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, FSDP2) |
+| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |
 | `examples/llm_ptq/README.md` | Step 3: support matrix, CLI flags, accuracy |
 | `modelopt/torch/quantization/config.py` | Step 3: format definitions |
 | `modelopt/torch/export/model_utils.py` | Step 4C: TRT-LLM export type mapping |
 
@@ -68,3 +68,10 @@ This catches script errors cheaply before using GPU quota on a real run.
 See `skills/common/slurm-setup.md` section 2 for the smoke test partition pattern.
 
 Only submit the full calibration job after the smoke test exits cleanly.
+
+---
+
+## 4. PTQ-Specific Notes
+
+- **Gated datasets**: Some calibration datasets (e.g., `nvidia/Nemotron-Post-Training-Dataset-v2`) require HF authentication. Set `HF_TOKEN` in the job environment, or use `--dataset cnn_dailymail` to use a non-gated alternative.
+- **NFS permissions**: Docker + NFS root_squash causes `PermissionError` on output/cache dirs. See `skills/common/slurm-setup.md` section 5 for fixes.
@@ -49,14 +49,16 @@ print(type(cfg).__name__)
      grep -r "class <ArchName>" /tmp/transformers-main/src/transformers/models/
      ```
 
-     - **Found** → install from that clone: `pip install /tmp/transformers-main --quiet`, then re-run `AutoConfig.from_pretrained()`.
+     - **Found** → install with deps: `pip install /tmp/transformers-main`, then re-run `AutoConfig.from_pretrained()`. **Important**: if ModelOpt is already installed, its `[hf]` extras may have pinned an older transformers. Install ModelOpt first, then upgrade transformers **after** (with deps, not `--no-deps`) so compatible `huggingface_hub` and other transitive deps are pulled in.
      - **Not found** → ask the user: *"The checkpoint uses `<ArchName>` which isn't in released or main-branch transformers. Do you have a private fork or custom modeling code?"*
 
 - **No `config.json`** → not a standard HF checkpoint. List the directory for README or `.py` files. If nothing useful, ask the user for the modeling code.
 
 ## Step B — Is the checkpoint already FP8-quantized?
 
-Check `config.json` for `"quantization_config"` or scan weight files for `*_scale_inv*` tensors. If found, the model must be dequantized before re-quantizing. HuggingFace's `WeightConverter` only handles standard `weight` / `weight_scale_inv` names and will silently miss non-standard parameter names (e.g., 3D expert tensors in MoE layers). See **Pattern 5** below.
+Check `config.json` for `"quantization_config"` with `"quant_method": "fp8"`, or scan weight files for `*_scale_inv*` tensors. If the model uses standard `FP8Linear` modules (2D weights with `weight` + `weight_scale_inv`), ModelOpt's `_QuantFP8Linear` plugin handles them automatically — no manual dequantization needed. The plugin keeps weights in FP8 and dequantizes lazily during calibration, which is memory-efficient.
+
+Manual dequantization is only needed for **non-standard parameter names** (e.g., 3D expert tensors in MoE layers) that the plugin doesn't cover. See **Pattern 5** below.
 
 ## Step C — Determine what custom patches are needed
 
@@ -69,7 +71,7 @@ Custom patches are required when:
 - **Fused/batched expert weights** — experts stored as a single parameter (e.g., 3D `[num_experts, in, out]`) rather than separate `nn.Linear` modules → Pattern 1 + 3
 - **Self-defined weight parameters** (`nn.Parameter` used directly instead of `nn.Linear`) — common in non-HF or research models → Pattern 1 + 3
 - **VLM structure** (vision encoder that should be excluded) → Pattern 4
-- **FP8 checkpoint** that needs dequantization before re-quantizing → Pattern 5
+- **FP8 checkpoint with non-standard parameter names** (standard `FP8Linear` is handled automatically by the `_QuantFP8Linear` plugin) → Pattern 5
 
 ## Step D — Check weight names against ModelOpt's config patterns
 
@@ -187,7 +189,9 @@ Both methods replace all instances of `original_cls` with `quantized_cls` during
 
 ## Pattern 4: VLM Language Model Extraction
 
-For multimodal models, only quantize the language model backbone:
+**Note**: `hf_ptq.py` already handles VLMs automatically via `extract_and_prepare_language_model_from_vl()`. It detects multimodal models, extracts the language backbone, and disables quantization for vision/projector modules. This works for most VLMs (tested with Mistral3/Devstral, Nemotron VL, Llama VL, etc.) — try `hf_ptq.py` first before writing custom VLM handling.
+
+For custom scripts or when `hf_ptq.py` doesn't handle the VLM correctly, only quantize the language model backbone:
 
 ```python
 from modelopt.torch.export.model_utils import get_language_model_from_vl, is_multimodal_model
@@ -218,30 +222,32 @@ quant_cfg["quant_cfg"]["*multi_modal_projector*"] = {"enable": False}
 
 **Known VLM export issue**: The export step (`requantize_resmooth_fused_llm_layers` in `unified_export_hf.py`) may try to run a dummy forward pass on the full VLM instead of the language model backbone. This currently only handles Nemotron VLMs. If hit, patch the export to use `is_multimodal_model()` for the VLM check instead of model-specific string matching.
 
-## Pattern 5: FP8 Checkpoint Dequantization
+## Pattern 5: FP8 Checkpoint Handling
+
+### Standard FP8Linear modules (preferred — no action needed)
 
-### Standard nn.Linear weights
+ModelOpt's `_QuantFP8Linear` plugin (`modelopt/torch/quantization/plugins/huggingface.py`) automatically handles HuggingFace `FP8Linear` modules. It:
 
-HuggingFace handles these automatically with `dequantize=True`:
+1. Keeps weights **compact in FP8** in GPU memory during calibration
+2. **Dequantizes lazily** on-the-fly during calibration forward passes via `weight_dequant()`
+3. Has `unpack_weight()` for full dequantization at export time
+
+This is registered automatically for `transformers.integrations.finegrained_fp8.FP8Linear`. It requires **Triton** to be installed (used internally for FP8 dequantization kernels). Just load the model normally — no `FineGrainedFP8Config(dequantize=True)` needed:
 
 ```python
-from transformers.utils.quantization_config import FineGrainedFP8Config
-
-model = AutoModel.from_pretrained(
-    model_path,
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
-    quantization_config=FineGrainedFP8Config(dequantize=True),
-)
+model = AutoModel.from_pretrained(model_path, device_map="auto", torch_dtype="auto")
+# FP8Linear modules stay in FP8 → _QuantFP8Linear handles dequant during calibration
 ```
 
+**Do NOT use `FineGrainedFP8Config(dequantize=True)`** — it expands the entire model to BF16 upfront, wasting ~2x GPU memory. The plugin approach is both more memory-efficient and simpler.
+
 ### Non-standard parameter names (e.g., 3D expert weights)
 
-HF's `WeightConverter` uses source patterns `["weight$", "weight_scale_inv", "activation_scale"]`. Parameters with names like `gate_up_proj`, `down_proj`, `w1`, `w2`, `w3` won't match these patterns and will remain in FP8 after loading. Dequantize them manually:
+The `_QuantFP8Linear` plugin only handles standard 2D `FP8Linear` modules with `weight` + `weight_scale_inv`. Parameters with non-standard names (e.g., `gate_up_proj`, `down_proj`, `w1`/`w2`/`w3` in fused MoE experts) won't be covered. For these, dequantize manually after loading:
 
 ```python
 def dequantize_fp8_params(model, param_names=("gate_up_proj", "down_proj")):
-    """Dequantize remaining FP8 parameters that HF's WeightConverter missed."""
+    """Dequantize remaining FP8 parameters that the plugin doesn't cover."""
     count = 0
     for name, module in model.named_modules():
         for param_name in param_names:
@@ -252,10 +258,8 @@ def dequantize_fp8_params(model, param_names=("gate_up_proj", "down_proj")):
             if scale is None:
                 param.data = param.data.to(torch.bfloat16)
             elif scale.dim() == 1:
-                # Per-tensor scale
                 param.data = param.data.to(torch.bfloat16) * scale.data[:, None, None].to(torch.bfloat16)
             elif scale.dim() == 3:
-                # Per-block scale: reshape, broadcast, multiply
                 w = param.data
                 s = scale.data
                 assert w.shape[-2] % s.shape[-2] == 0 and w.shape[-1] % s.shape[-1] == 0, (
 
@@ -57,6 +57,21 @@
         "Runs hf_ptq.py (not a standalone custom script)",
         "Runs smoke test first, then full calibration"
       ]
+    },
+    {
+      "id": 5,
+      "prompt": "Quantize MiniMax-M2.5 to nvfp4",
+      "expected_output": "Agent detects FP8 pre-quantized checkpoint, relies on _QuantFP8Linear plugin for standard FP8Linear modules, dequantizes non-standard MoE expert weights manually, then runs PTQ",
+      "files": [],
+      "expectations": [
+        "Checks README — MiniMax-M2.5 is NOT listed",
+        "Reads unsupported-models.md (4C path)",
+        "Detects FP8 quantization_config in config.json (Step B)",
+        "Identifies _QuantFP8Linear plugin handles standard FP8Linear modules automatically",
+        "Identifies non-standard 3D MoE expert weights that need manual dequantization (Pattern 5)",
+        "Applies manual dequantize_fp8_params for fused expert tensors",
+        "Runs smoke test first, then full calibration"
+      ]
     }
   ]
 }
@@ -27,7 +27,7 @@ Changelog
 
 - [Security] Changed the default of ``weights_only`` to ``True`` in ``torch.load`` for secure checkpoint loading. If you need to load a checkpoint that requires unpickling arbitrary objects, first register the class in ``torch.serialization.add_safe_globals([cls])`` before loading. Added :meth:`safe_save <modelopt.torch.utils.serialization.safe_save>` and :meth:`safe_load <modelopt.torch.utils.serialization.safe_load>` API to save and load checkpoints securely.
 - Bump minimum required PyTorch version to 2.8.
-- [Experimental] Add support for transformers>=5.0. Unified Hugging Face checkpoint export for quantized checkpoints may not work for MoE models with transformers>=5.0 yet.
+- [Experimental] Add support for transformers>=5.0, including generic PTQ and unified HF checkpoint export for fused MoE expert modules (Mixtral, Qwen2-MoE, Qwen3-MoE, Qwen3.5-MoE, DeepSeek-V3, Jamba, OLMoE, etc.).
 - Improve ``megatron_preprocess_data``: add ``--reasoning_content`` support for Nemotron v3 datasets, eliminate intermediate JSONL for HuggingFace datasets, return output file prefixes from the Python API, add gzip input support (``.jsonl.gz``), add ``--strip_newlines`` flag for plain-text pretraining data, add ``--hf_streaming`` for very large datasets (only consumed rows downloaded), and auto-shuffle when ``--hf_max_samples_per_split`` is set to avoid biased sampling.
 
 0.43 (2026-04-09)
 
@@ -1,5 +1,7 @@
 # CLAUDE.md
 
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
 NVIDIA Model Optimizer (ModelOpt): open-source library for model optimization techniques including
 quantization, pruning, distillation, sparsity, and speculative decoding to accelerate inference.
 Primarily Python codebase with optional C++/CUDA extensions supporting PyTorch, ONNX, and Hugging Face/Megatron models.
@@ -11,24 +13,27 @@ Primarily Python codebase with optional C++/CUDA extensions supporting PyTorch,
 
 **CRITICAL (YOU MUST):**
 
-- NVIDIA Apache 2.0 license header on ALL new Python/C++/CUDA files (see `LICENSE_HEADER`)
+- NVIDIA Apache 2.0 license header on ALL new Python/C++/CUDA files — use the SPDX format from `LICENSE_HEADER` (auto-inserted by pre-commit for most files, but must be added manually for files copied from third-party sources, which are excluded from the hook)
 - `git commit -s -S` (DCO sign-off + cryptographic signing required). Never attribute AI tools in
   sign-off line
 - `pre-commit` hooks run on commit — if files are modified by hooks, re-stage and commit again
 - PRs require CODEOWNERS review (auto-assigned based on `.github/CODEOWNERS`)
 - After rebasing, always re-run tests locally before pushing
 - All code must follow the security guidelines in `SECURITY.md` — violations are blocked as pre-merge errors
 - For contribution guidelines, commit conventions, and PR requirements, see `CONTRIBUTING.md`
+- New PIP dependencies require license verification — non-permissive licenses need justification and approval from `@NVIDIA/modelopt-setup-codeowners`
 
 ## Common Commands
 
 | Task | Command |
 |------|---------|
 | Install (editable + dev) | `pip install -e ".[dev]"` |
+| Enable pre-commit hooks | `pre-commit install` |
 | CPU unit tests | `python -m pytest tests/unit` |
 | GPU unit tests | `python -m pytest tests/gpu` |
 | Megatron GPU tests | `python -m pytest tests/gpu_megatron` |
 | TRT-LLM GPU tests | `python -m pytest tests/gpu_trtllm` |
+| Single test file | `python -m pytest tests/unit/torch/quantization/test_quant_config.py` |
 | Pattern match | `pytest tests/unit -k "test_quantize"` |
 | Lint + format (all files) | `pre-commit run --all-files` |
 | Lint (diff only) | `pre-commit run --from-ref origin/main --to-ref HEAD` |
@@ -69,6 +74,11 @@ A **mode** is the unit of model optimization in ModelOpt. Each algorithm (quanti
 etc.) is implemented as one or more modes. Modes are recorded in the model's `modelopt_state` so
 optimization workflows can be composed, saved, and restored.
 
+The main entry points are in `modelopt/torch/opt/conversion.py`:
+- `apply_mode(model, mode, ...)` — applies an optimization mode to a model
+- `restore(model, ...)` — restores a model to a previously saved optimization state
+- `save(model, ...)` / `modelopt_state(model)` — captures the current optimization state
+
 ### Core Abstraction: Recipes
 
 A **recipe** is a declarative YAML specification of an optimization configuration. Recipes decouple optimization specs from code, enabling reuse, sharing, and version control.
 
@@ -1,4 +1,3 @@
 datasets>=2.14.5
-onnx
 torch==2.9.0
 transformers==4.57.3
@@ -4,7 +4,6 @@ datasets==2.19.0
 evaluate
 jiwer
 librosa
-onnx
 onnxruntime-gpu==1.23.2
 optimum==1.23.3
 soundfile
 
@@ -18,18 +18,6 @@
 import sys
 import warnings
 
-import onnx.helper
-
-if not hasattr(onnx.helper, "float32_to_bfloat16"):
-    import ml_dtypes
-    import numpy as np
-
-    def _float32_to_bfloat16(value):
-        arr = np.array(value, dtype=np.float32)
-        return int(arr.astype(ml_dtypes.bfloat16).view(np.uint16))
-
-    onnx.helper.float32_to_bfloat16 = _float32_to_bfloat16
-
 MIN_PYTHON_VERSION = (3, 10)
 
 try:
Original file line number	Diff line number	Diff line change
`@@ -57,6 +57,21 @@`
`57`	`57`	`"Runs hf_ptq.py (not a standalone custom script)",`
`58`	`58`	`"Runs smoke test first, then full calibration"`
`59`	`59`	`]`
	`60`	`+ },`
	`61`	`+ {`
	`62`	`+ "id": 5,`
	`63`	`+ "prompt": "Quantize MiniMax-M2.5 to nvfp4",`
	`64`	`+ "expected_output": "Agent detects FP8 pre-quantized checkpoint, relies on _QuantFP8Linear plugin for standard FP8Linear modules, dequantizes non-standard MoE expert weights manually, then runs PTQ",`
	`65`	`+ "files": [],`
	`66`	`+ "expectations": [`
	`67`	`+ "Checks README — MiniMax-M2.5 is NOT listed",`
	`68`	`+ "Reads unsupported-models.md (4C path)",`
	`69`	`+ "Detects FP8 quantization_config in config.json (Step B)",`
	`70`	`+ "Identifies _QuantFP8Linear plugin handles standard FP8Linear modules automatically",`
	`71`	`+ "Identifies non-standard 3D MoE expert weights that need manual dequantization (Pattern 5)",`
	`72`	`+ "Applies manual dequantize_fp8_params for fused expert tensors",`
	`73`	`+ "Runs smoke test first, then full calibration"`
	`74`	`+ ]`
`60`	`75`	`}`
`61`	`76`	`]`
`62`	`77`	`}`