Skip to content

Commit 1f53e89

Browse files
authored
Merge branch 'main' into shengliangx/early-quant-cfg-validation
2 parents 9182ebd + 3baa2da commit 1f53e89

File tree

19 files changed

+774
-186
lines changed

19 files changed

+774
-186
lines changed

.claude/skills/common/slurm-setup.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,47 @@ include a multi-node-capable partition as the last fallback.
7474

7575
Only submit the full job after the smoke test exits cleanly.
7676

77+
### Docker (non-pyxis) variant
78+
79+
Some clusters don't have pyxis/enroot installed and instead use plain `docker run` on compute nodes. In this case, replace the `srun --container-image` pattern with `docker run` inside the job script:
80+
81+
```bash
82+
#!/bin/bash
83+
#SBATCH --job-name=<name>
84+
#SBATCH --account=<account>
85+
#SBATCH --partition=<partition>
86+
#SBATCH --nodes=1
87+
#SBATCH --ntasks-per-node=1
88+
#SBATCH --gpus-per-node=<N>
89+
#SBATCH --time=<HH:MM:SS>
90+
#SBATCH --output=<log_dir>/<name>_%j.log
91+
92+
docker run --rm \
93+
--gpus all \
94+
--shm-size=32g \
95+
--ulimit memlock=-1 \
96+
--network host \
97+
-v <data_root>:<data_root> \
98+
-e CALIB_SIZE="${CALIB_SIZE:-512}" \
99+
<container_image> \
100+
bash <path/to/run_script.sh>
101+
```
102+
103+
**Key differences from pyxis**:
104+
105+
- No `srun` wrapper needed — SLURM just allocates the node, Docker runs the container
106+
- Mount paths with `-v` instead of `--container-mounts`
107+
- Pass env vars with `-e` instead of relying on SLURM env propagation
108+
- Use the two-script pattern: SLURM wrapper (sbatch directives + `docker run`) and inner runner (the actual work). The inner runner should unset SLURM env vars and set `HF_HOME`/`HF_DATASETS_OFFLINE` as needed
109+
- **NFS root_squash**: see section 5
110+
111+
**How to detect which pattern to use**: Ask the user how they normally run containers, or check:
112+
113+
```bash
114+
which enroot 2>/dev/null && echo "pyxis/enroot available"
115+
which docker 2>/dev/null && echo "docker available"
116+
```
117+
77118
---
78119

79120
## 3. Monitor Until Completion
@@ -126,3 +167,28 @@ srun \
126167
```
127168

128169
Adjust `--nodes`, `--gpus-per-node`, and the distributed launch command per your workload.
170+
171+
---
172+
173+
## 5. NFS root_squash and Docker Permissions
174+
175+
Docker containers typically run as root, but NFS filesystems with `root_squash` (the default) map root to `nobody`, blocking writes to directories owned by the user. This causes `PermissionError` when creating cache lock files, writing output, or saving logs.
176+
177+
This affects both pyxis/enroot (`srun --container-image`) and plain `docker run` workflows.
178+
179+
**Preferred fix** — run Docker with the host user's UID/GID to match NFS ownership:
180+
181+
```bash
182+
docker run --user $(id -u):$(id -g) ...
183+
```
184+
185+
> Note: `--user` may cause issues if the container expects root for package installation. In that case, fall back to the chmod approach below.
186+
187+
**Fallback fix** — open permissions before submitting the job:
188+
189+
```bash
190+
chmod -R g+rwX /path/to/workspace/
191+
chmod -R g+rwX /path/to/.hf_cache/
192+
```
193+
194+
Scope `chmod` to only the directories the job needs — avoid world-writable paths on shared clusters.

.claude/skills/ptq/SKILL.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -118,10 +118,16 @@ Report the path and size to the user.
118118
- `mtq.register()` classes **must** define `_setup()` and call it from `__init__`
119119
- Call `mto.enable_huggingface_checkpointing()` **before** quantization
120120
- Wildcard `*gate*` matches too broadly — use `*mlp.gate*` or `*router*`
121-
- VLMs need `AutoModel`, not `AutoModelForCausalLM`
122-
- FP8 loading: `FineGrainedFP8Config(dequantize=True)`, not a dict
121+
- VLMs: `hf_ptq.py` auto-extracts the language model via `extract_and_prepare_language_model_from_vl()` — no manual VLM handling needed in most cases
122+
- FP8 checkpoints: prefer `_QuantFP8Linear` (lazy dequant) over `FineGrainedFP8Config(dequantize=True)` which wastes ~2x memory. See `references/unsupported-models.md` for details
123123
- Custom quantizer names must end with `_input_quantizer` or `_weight_quantizer`
124124

125+
## Common Pitfalls
126+
127+
- **Transformers version**: Newer models (e.g., Devstral/ministral3) may require a transformers version not yet in the container. Check `config.json` for `transformers_version` and upgrade if needed. Install ModelOpt first, then upgrade transformers **with** deps (not `--no-deps`) to pull compatible `huggingface_hub`
128+
- **Gated datasets**: Some calibration datasets require HF authentication. Ensure `HF_TOKEN` is set in the job environment, or use `--dataset cnn_dailymail` as a non-gated alternative
129+
- **NFS root_squash + Docker**: Docker runs as root, but NFS squashes root to `nobody`. Use `docker run --user $(id -u):$(id -g)`, or `chmod -R a+rwX` on needed directories as a fallback. See `skills/common/slurm-setup.md` section 5
130+
125131
## References
126132

127133
| Reference | When to read |
@@ -133,7 +139,7 @@ Report the path and size to the user.
133139
| `references/unsupported-models.md` | Step 4C only (unlisted model) |
134140
| `skills/common/remote-execution.md` | Step 4A/4C only, if target is remote |
135141
| `skills/common/slurm-setup.md` | Step 4A/4C only, if using SLURM manually (not launcher) |
136-
| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, FSDP2) |
142+
| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |
137143
| `examples/llm_ptq/README.md` | Step 3: support matrix, CLI flags, accuracy |
138144
| `modelopt/torch/quantization/config.py` | Step 3: format definitions |
139145
| `modelopt/torch/export/model_utils.py` | Step 4C: TRT-LLM export type mapping |

.claude/skills/ptq/references/slurm-setup-ptq.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,3 +68,10 @@ This catches script errors cheaply before using GPU quota on a real run.
6868
See `skills/common/slurm-setup.md` section 2 for the smoke test partition pattern.
6969

7070
Only submit the full calibration job after the smoke test exits cleanly.
71+
72+
---
73+
74+
## 4. PTQ-Specific Notes
75+
76+
- **Gated datasets**: Some calibration datasets (e.g., `nvidia/Nemotron-Post-Training-Dataset-v2`) require HF authentication. Set `HF_TOKEN` in the job environment, or use `--dataset cnn_dailymail` to use a non-gated alternative.
77+
- **NFS permissions**: Docker + NFS root_squash causes `PermissionError` on output/cache dirs. See `skills/common/slurm-setup.md` section 5 for fixes.

.claude/skills/ptq/references/unsupported-models.md

Lines changed: 23 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -49,14 +49,16 @@ print(type(cfg).__name__)
4949
grep -r "class <ArchName>" /tmp/transformers-main/src/transformers/models/
5050
```
5151

52-
- **Found** → install from that clone: `pip install /tmp/transformers-main --quiet`, then re-run `AutoConfig.from_pretrained()`.
52+
- **Found** → install with deps: `pip install /tmp/transformers-main`, then re-run `AutoConfig.from_pretrained()`. **Important**: if ModelOpt is already installed, its `[hf]` extras may have pinned an older transformers. Install ModelOpt first, then upgrade transformers **after** (with deps, not `--no-deps`) so compatible `huggingface_hub` and other transitive deps are pulled in.
5353
- **Not found** → ask the user: *"The checkpoint uses `<ArchName>` which isn't in released or main-branch transformers. Do you have a private fork or custom modeling code?"*
5454
5555
- **No `config.json`** → not a standard HF checkpoint. List the directory for README or `.py` files. If nothing useful, ask the user for the modeling code.
5656
5757
## Step B — Is the checkpoint already FP8-quantized?
5858
59-
Check `config.json` for `"quantization_config"` or scan weight files for `*_scale_inv*` tensors. If found, the model must be dequantized before re-quantizing. HuggingFace's `WeightConverter` only handles standard `weight` / `weight_scale_inv` names and will silently miss non-standard parameter names (e.g., 3D expert tensors in MoE layers). See **Pattern 5** below.
59+
Check `config.json` for `"quantization_config"` with `"quant_method": "fp8"`, or scan weight files for `*_scale_inv*` tensors. If the model uses standard `FP8Linear` modules (2D weights with `weight` + `weight_scale_inv`), ModelOpt's `_QuantFP8Linear` plugin handles them automatically — no manual dequantization needed. The plugin keeps weights in FP8 and dequantizes lazily during calibration, which is memory-efficient.
60+
61+
Manual dequantization is only needed for **non-standard parameter names** (e.g., 3D expert tensors in MoE layers) that the plugin doesn't cover. See **Pattern 5** below.
6062
6163
## Step C — Determine what custom patches are needed
6264
@@ -69,7 +71,7 @@ Custom patches are required when:
6971
- **Fused/batched expert weights** — experts stored as a single parameter (e.g., 3D `[num_experts, in, out]`) rather than separate `nn.Linear` modules → Pattern 1 + 3
7072
- **Self-defined weight parameters** (`nn.Parameter` used directly instead of `nn.Linear`) — common in non-HF or research models → Pattern 1 + 3
7173
- **VLM structure** (vision encoder that should be excluded) → Pattern 4
72-
- **FP8 checkpoint** that needs dequantization before re-quantizing → Pattern 5
74+
- **FP8 checkpoint with non-standard parameter names** (standard `FP8Linear` is handled automatically by the `_QuantFP8Linear` plugin) → Pattern 5
7375
7476
## Step D — Check weight names against ModelOpt's config patterns
7577
@@ -187,7 +189,9 @@ Both methods replace all instances of `original_cls` with `quantized_cls` during
187189
188190
## Pattern 4: VLM Language Model Extraction
189191
190-
For multimodal models, only quantize the language model backbone:
192+
**Note**: `hf_ptq.py` already handles VLMs automatically via `extract_and_prepare_language_model_from_vl()`. It detects multimodal models, extracts the language backbone, and disables quantization for vision/projector modules. This works for most VLMs (tested with Mistral3/Devstral, Nemotron VL, Llama VL, etc.) — try `hf_ptq.py` first before writing custom VLM handling.
193+
194+
For custom scripts or when `hf_ptq.py` doesn't handle the VLM correctly, only quantize the language model backbone:
191195
192196
```python
193197
from modelopt.torch.export.model_utils import get_language_model_from_vl, is_multimodal_model
@@ -218,30 +222,32 @@ quant_cfg["quant_cfg"]["*multi_modal_projector*"] = {"enable": False}
218222
219223
**Known VLM export issue**: The export step (`requantize_resmooth_fused_llm_layers` in `unified_export_hf.py`) may try to run a dummy forward pass on the full VLM instead of the language model backbone. This currently only handles Nemotron VLMs. If hit, patch the export to use `is_multimodal_model()` for the VLM check instead of model-specific string matching.
220224
221-
## Pattern 5: FP8 Checkpoint Dequantization
225+
## Pattern 5: FP8 Checkpoint Handling
226+
227+
### Standard FP8Linear modules (preferred — no action needed)
222228
223-
### Standard nn.Linear weights
229+
ModelOpt's `_QuantFP8Linear` plugin (`modelopt/torch/quantization/plugins/huggingface.py`) automatically handles HuggingFace `FP8Linear` modules. It:
224230
225-
HuggingFace handles these automatically with `dequantize=True`:
231+
1. Keeps weights **compact in FP8** in GPU memory during calibration
232+
2. **Dequantizes lazily** on-the-fly during calibration forward passes via `weight_dequant()`
233+
3. Has `unpack_weight()` for full dequantization at export time
234+
235+
This is registered automatically for `transformers.integrations.finegrained_fp8.FP8Linear`. It requires **Triton** to be installed (used internally for FP8 dequantization kernels). Just load the model normally — no `FineGrainedFP8Config(dequantize=True)` needed:
226236
227237
```python
228-
from transformers.utils.quantization_config import FineGrainedFP8Config
229-
230-
model = AutoModel.from_pretrained(
231-
model_path,
232-
torch_dtype=torch.bfloat16,
233-
device_map="auto",
234-
quantization_config=FineGrainedFP8Config(dequantize=True),
235-
)
238+
model = AutoModel.from_pretrained(model_path, device_map="auto", torch_dtype="auto")
239+
# FP8Linear modules stay in FP8 → _QuantFP8Linear handles dequant during calibration
236240
```
237241
242+
**Do NOT use `FineGrainedFP8Config(dequantize=True)`** — it expands the entire model to BF16 upfront, wasting ~2x GPU memory. The plugin approach is both more memory-efficient and simpler.
243+
238244
### Non-standard parameter names (e.g., 3D expert weights)
239245
240-
HF's `WeightConverter` uses source patterns `["weight$", "weight_scale_inv", "activation_scale"]`. Parameters with names like `gate_up_proj`, `down_proj`, `w1`, `w2`, `w3` won't match these patterns and will remain in FP8 after loading. Dequantize them manually:
246+
The `_QuantFP8Linear` plugin only handles standard 2D `FP8Linear` modules with `weight` + `weight_scale_inv`. Parameters with non-standard names (e.g., `gate_up_proj`, `down_proj`, `w1`/`w2`/`w3` in fused MoE experts) won't be covered. For these, dequantize manually after loading:
241247
242248
```python
243249
def dequantize_fp8_params(model, param_names=("gate_up_proj", "down_proj")):
244-
"""Dequantize remaining FP8 parameters that HF's WeightConverter missed."""
250+
"""Dequantize remaining FP8 parameters that the plugin doesn't cover."""
245251
count = 0
246252
for name, module in model.named_modules():
247253
for param_name in param_names:
@@ -252,10 +258,8 @@ def dequantize_fp8_params(model, param_names=("gate_up_proj", "down_proj")):
252258
if scale is None:
253259
param.data = param.data.to(torch.bfloat16)
254260
elif scale.dim() == 1:
255-
# Per-tensor scale
256261
param.data = param.data.to(torch.bfloat16) * scale.data[:, None, None].to(torch.bfloat16)
257262
elif scale.dim() == 3:
258-
# Per-block scale: reshape, broadcast, multiply
259263
w = param.data
260264
s = scale.data
261265
assert w.shape[-2] % s.shape[-2] == 0 and w.shape[-1] % s.shape[-1] == 0, (

.claude/skills/ptq/tests.json

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,21 @@
5757
"Runs hf_ptq.py (not a standalone custom script)",
5858
"Runs smoke test first, then full calibration"
5959
]
60+
},
61+
{
62+
"id": 5,
63+
"prompt": "Quantize MiniMax-M2.5 to nvfp4",
64+
"expected_output": "Agent detects FP8 pre-quantized checkpoint, relies on _QuantFP8Linear plugin for standard FP8Linear modules, dequantizes non-standard MoE expert weights manually, then runs PTQ",
65+
"files": [],
66+
"expectations": [
67+
"Checks README — MiniMax-M2.5 is NOT listed",
68+
"Reads unsupported-models.md (4C path)",
69+
"Detects FP8 quantization_config in config.json (Step B)",
70+
"Identifies _QuantFP8Linear plugin handles standard FP8Linear modules automatically",
71+
"Identifies non-standard 3D MoE expert weights that need manual dequantization (Pattern 5)",
72+
"Applies manual dequantize_fp8_params for fused expert tensors",
73+
"Runs smoke test first, then full calibration"
74+
]
6075
}
6176
]
6277
}

CHANGELOG.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Changelog
2727

2828
- [Security] Changed the default of ``weights_only`` to ``True`` in ``torch.load`` for secure checkpoint loading. If you need to load a checkpoint that requires unpickling arbitrary objects, first register the class in ``torch.serialization.add_safe_globals([cls])`` before loading. Added :meth:`safe_save <modelopt.torch.utils.serialization.safe_save>` and :meth:`safe_load <modelopt.torch.utils.serialization.safe_load>` API to save and load checkpoints securely.
2929
- Bump minimum required PyTorch version to 2.8.
30-
- [Experimental] Add support for transformers>=5.0. Unified Hugging Face checkpoint export for quantized checkpoints may not work for MoE models with transformers>=5.0 yet.
30+
- [Experimental] Add support for transformers>=5.0, including generic PTQ and unified HF checkpoint export for fused MoE expert modules (Mixtral, Qwen2-MoE, Qwen3-MoE, Qwen3.5-MoE, DeepSeek-V3, Jamba, OLMoE, etc.).
3131
- Improve ``megatron_preprocess_data``: add ``--reasoning_content`` support for Nemotron v3 datasets, eliminate intermediate JSONL for HuggingFace datasets, return output file prefixes from the Python API, add gzip input support (``.jsonl.gz``), add ``--strip_newlines`` flag for plain-text pretraining data, add ``--hf_streaming`` for very large datasets (only consumed rows downloaded), and auto-shuffle when ``--hf_max_samples_per_split`` is set to avoid biased sampling.
3232

3333
0.43 (2026-04-09)

CLAUDE.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# CLAUDE.md
22

3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
35
NVIDIA Model Optimizer (ModelOpt): open-source library for model optimization techniques including
46
quantization, pruning, distillation, sparsity, and speculative decoding to accelerate inference.
57
Primarily Python codebase with optional C++/CUDA extensions supporting PyTorch, ONNX, and Hugging Face/Megatron models.
@@ -11,24 +13,27 @@ Primarily Python codebase with optional C++/CUDA extensions supporting PyTorch,
1113

1214
**CRITICAL (YOU MUST):**
1315

14-
- NVIDIA Apache 2.0 license header on ALL new Python/C++/CUDA files (see `LICENSE_HEADER`)
16+
- NVIDIA Apache 2.0 license header on ALL new Python/C++/CUDA files — use the SPDX format from `LICENSE_HEADER` (auto-inserted by pre-commit for most files, but must be added manually for files copied from third-party sources, which are excluded from the hook)
1517
- `git commit -s -S` (DCO sign-off + cryptographic signing required). Never attribute AI tools in
1618
sign-off line
1719
- `pre-commit` hooks run on commit — if files are modified by hooks, re-stage and commit again
1820
- PRs require CODEOWNERS review (auto-assigned based on `.github/CODEOWNERS`)
1921
- After rebasing, always re-run tests locally before pushing
2022
- All code must follow the security guidelines in `SECURITY.md` — violations are blocked as pre-merge errors
2123
- For contribution guidelines, commit conventions, and PR requirements, see `CONTRIBUTING.md`
24+
- New PIP dependencies require license verification — non-permissive licenses need justification and approval from `@NVIDIA/modelopt-setup-codeowners`
2225

2326
## Common Commands
2427

2528
| Task | Command |
2629
|------|---------|
2730
| Install (editable + dev) | `pip install -e ".[dev]"` |
31+
| Enable pre-commit hooks | `pre-commit install` |
2832
| CPU unit tests | `python -m pytest tests/unit` |
2933
| GPU unit tests | `python -m pytest tests/gpu` |
3034
| Megatron GPU tests | `python -m pytest tests/gpu_megatron` |
3135
| TRT-LLM GPU tests | `python -m pytest tests/gpu_trtllm` |
36+
| Single test file | `python -m pytest tests/unit/torch/quantization/test_quant_config.py` |
3237
| Pattern match | `pytest tests/unit -k "test_quantize"` |
3338
| Lint + format (all files) | `pre-commit run --all-files` |
3439
| Lint (diff only) | `pre-commit run --from-ref origin/main --to-ref HEAD` |
@@ -69,6 +74,11 @@ A **mode** is the unit of model optimization in ModelOpt. Each algorithm (quanti
6974
etc.) is implemented as one or more modes. Modes are recorded in the model's `modelopt_state` so
7075
optimization workflows can be composed, saved, and restored.
7176

77+
The main entry points are in `modelopt/torch/opt/conversion.py`:
78+
- `apply_mode(model, mode, ...)` — applies an optimization mode to a model
79+
- `restore(model, ...)` — restores a model to a previously saved optimization state
80+
- `save(model, ...)` / `modelopt_state(model)` — captures the current optimization state
81+
7282
### Core Abstraction: Recipes
7383

7484
A **recipe** is a declarative YAML specification of an optimization configuration. Recipes decouple optimization specs from code, enabling reuse, sharing, and version control.
Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
11
datasets>=2.14.5
2-
onnx
32
torch==2.9.0
43
transformers==4.57.3

examples/windows/onnx_ptq/whisper/requirements.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ datasets==2.19.0
44
evaluate
55
jiwer
66
librosa
7-
onnx
87
onnxruntime-gpu==1.23.2
98
optimum==1.23.3
109
soundfile

modelopt/onnx/__init__.py

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -18,18 +18,6 @@
1818
import sys
1919
import warnings
2020

21-
import onnx.helper
22-
23-
if not hasattr(onnx.helper, "float32_to_bfloat16"):
24-
import ml_dtypes
25-
import numpy as np
26-
27-
def _float32_to_bfloat16(value):
28-
arr = np.array(value, dtype=np.float32)
29-
return int(arr.astype(ml_dtypes.bfloat16).view(np.uint16))
30-
31-
onnx.helper.float32_to_bfloat16 = _float32_to_bfloat16
32-
3321
MIN_PYTHON_VERSION = (3, 10)
3422

3523
try:

0 commit comments

Comments
 (0)