You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .claude/skills/common/slurm-setup.md
+66Lines changed: 66 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -74,6 +74,47 @@ include a multi-node-capable partition as the last fallback.
74
74
75
75
Only submit the full job after the smoke test exits cleanly.
76
76
77
+
### Docker (non-pyxis) variant
78
+
79
+
Some clusters don't have pyxis/enroot installed and instead use plain `docker run` on compute nodes. In this case, replace the `srun --container-image` pattern with `docker run` inside the job script:
80
+
81
+
```bash
82
+
#!/bin/bash
83
+
#SBATCH --job-name=<name>
84
+
#SBATCH --account=<account>
85
+
#SBATCH --partition=<partition>
86
+
#SBATCH --nodes=1
87
+
#SBATCH --ntasks-per-node=1
88
+
#SBATCH --gpus-per-node=<N>
89
+
#SBATCH --time=<HH:MM:SS>
90
+
#SBATCH --output=<log_dir>/<name>_%j.log
91
+
92
+
docker run --rm \
93
+
--gpus all \
94
+
--shm-size=32g \
95
+
--ulimit memlock=-1 \
96
+
--network host \
97
+
-v <data_root>:<data_root> \
98
+
-e CALIB_SIZE="${CALIB_SIZE:-512}" \
99
+
<container_image> \
100
+
bash <path/to/run_script.sh>
101
+
```
102
+
103
+
**Key differences from pyxis**:
104
+
105
+
- No `srun` wrapper needed — SLURM just allocates the node, Docker runs the container
106
+
- Mount paths with `-v` instead of `--container-mounts`
107
+
- Pass env vars with `-e` instead of relying on SLURM env propagation
108
+
- Use the two-script pattern: SLURM wrapper (sbatch directives + `docker run`) and inner runner (the actual work). The inner runner should unset SLURM env vars and set `HF_HOME`/`HF_DATASETS_OFFLINE` as needed
109
+
-**NFS root_squash**: see section 5
110
+
111
+
**How to detect which pattern to use**: Ask the user how they normally run containers, or check:
112
+
113
+
```bash
114
+
which enroot 2>/dev/null &&echo"pyxis/enroot available"
115
+
which docker 2>/dev/null &&echo"docker available"
116
+
```
117
+
77
118
---
78
119
79
120
## 3. Monitor Until Completion
@@ -126,3 +167,28 @@ srun \
126
167
```
127
168
128
169
Adjust `--nodes`, `--gpus-per-node`, and the distributed launch command per your workload.
170
+
171
+
---
172
+
173
+
## 5. NFS root_squash and Docker Permissions
174
+
175
+
Docker containers typically run as root, but NFS filesystems with `root_squash` (the default) map root to `nobody`, blocking writes to directories owned by the user. This causes `PermissionError` when creating cache lock files, writing output, or saving logs.
176
+
177
+
This affects both pyxis/enroot (`srun --container-image`) and plain `docker run` workflows.
178
+
179
+
**Preferred fix** — run Docker with the host user's UID/GID to match NFS ownership:
180
+
181
+
```bash
182
+
docker run --user $(id -u):$(id -g) ...
183
+
```
184
+
185
+
> Note: `--user` may cause issues if the container expects root for package installation. In that case, fall back to the chmod approach below.
186
+
187
+
**Fallback fix** — open permissions before submitting the job:
188
+
189
+
```bash
190
+
chmod -R g+rwX /path/to/workspace/
191
+
chmod -R g+rwX /path/to/.hf_cache/
192
+
```
193
+
194
+
Scope `chmod` to only the directories the job needs — avoid world-writable paths on shared clusters.
- Wildcard `*gate*` matches too broadly — use `*mlp.gate*` or `*router*`
121
-
- VLMs need `AutoModel`, not `AutoModelForCausalLM`
122
-
- FP8 loading: `FineGrainedFP8Config(dequantize=True)`, not a dict
121
+
- VLMs: `hf_ptq.py` auto-extracts the language model via `extract_and_prepare_language_model_from_vl()` — no manual VLM handling needed in most cases
122
+
- FP8 checkpoints: prefer `_QuantFP8Linear` (lazy dequant) over `FineGrainedFP8Config(dequantize=True)` which wastes ~2x memory. See `references/unsupported-models.md` for details
123
123
- Custom quantizer names must end with `_input_quantizer` or `_weight_quantizer`
124
124
125
+
## Common Pitfalls
126
+
127
+
-**Transformers version**: Newer models (e.g., Devstral/ministral3) may require a transformers version not yet in the container. Check `config.json` for `transformers_version` and upgrade if needed. Install ModelOpt first, then upgrade transformers **with** deps (not `--no-deps`) to pull compatible `huggingface_hub`
128
+
-**Gated datasets**: Some calibration datasets require HF authentication. Ensure `HF_TOKEN` is set in the job environment, or use `--dataset cnn_dailymail` as a non-gated alternative
129
+
-**NFS root_squash + Docker**: Docker runs as root, but NFS squashes root to `nobody`. Use `docker run --user $(id -u):$(id -g)`, or `chmod -R a+rwX` on needed directories as a fallback. See `skills/common/slurm-setup.md` section 5
130
+
125
131
## References
126
132
127
133
| Reference | When to read |
@@ -133,7 +139,7 @@ Report the path and size to the user.
133
139
|`references/unsupported-models.md`| Step 4C only (unlisted model) |
134
140
|`skills/common/remote-execution.md`| Step 4A/4C only, if target is remote |
135
141
|`skills/common/slurm-setup.md`| Step 4A/4C only, if using SLURM manually (not launcher) |
Copy file name to clipboardExpand all lines: .claude/skills/ptq/references/slurm-setup-ptq.md
+7Lines changed: 7 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -68,3 +68,10 @@ This catches script errors cheaply before using GPU quota on a real run.
68
68
See `skills/common/slurm-setup.md` section 2 for the smoke test partition pattern.
69
69
70
70
Only submit the full calibration job after the smoke test exits cleanly.
71
+
72
+
---
73
+
74
+
## 4. PTQ-Specific Notes
75
+
76
+
-**Gated datasets**: Some calibration datasets (e.g., `nvidia/Nemotron-Post-Training-Dataset-v2`) require HF authentication. Set `HF_TOKEN` in the job environment, or use `--dataset cnn_dailymail` to use a non-gated alternative.
77
+
-**NFS permissions**: Docker + NFS root_squash causes `PermissionError` on output/cache dirs. See `skills/common/slurm-setup.md` section 5 for fixes.
- **Found** → install from that clone: `pip install /tmp/transformers-main --quiet`, then re-run `AutoConfig.from_pretrained()`.
52
+
- **Found** → install with deps: `pip install /tmp/transformers-main`, then re-run `AutoConfig.from_pretrained()`. **Important**: if ModelOpt is already installed, its `[hf]` extras may have pinned an older transformers. Install ModelOpt first, then upgrade transformers **after** (with deps, not `--no-deps`) so compatible `huggingface_hub` and other transitive deps are pulled in.
53
53
- **Not found** → ask the user: *"The checkpoint uses `<ArchName>` which isn't in released or main-branch transformers. Do you have a private fork or custom modeling code?"*
54
54
55
55
- **No `config.json`** → not a standard HF checkpoint. List the directory for README or `.py` files. If nothing useful, ask the user for the modeling code.
56
56
57
57
## Step B — Is the checkpoint already FP8-quantized?
58
58
59
-
Check `config.json`for`"quantization_config"` or scan weight files for `*_scale_inv*` tensors. If found, the model must be dequantized before re-quantizing. HuggingFace's `WeightConverter` only handles standard `weight` / `weight_scale_inv` names and will silently miss non-standard parameter names (e.g., 3D expert tensorsin MoE layers). See **Pattern 5** below.
59
+
Check `config.json`for`"quantization_config"` with `"quant_method": "fp8"`, or scan weight files for `*_scale_inv*` tensors. If the model uses standard `FP8Linear` modules (2D weights with `weight` + `weight_scale_inv`), ModelOpt's `_QuantFP8Linear` plugin handles them automatically — no manual dequantization needed. The plugin keeps weightsin FP8 and dequantizes lazily during calibration, which is memory-efficient.
60
+
61
+
Manual dequantization is only needed for**non-standard parameter names** (e.g., 3D expert tensorsin MoE layers) that the plugin doesn't cover. See **Pattern 5** below.
60
62
61
63
## Step C — Determine what custom patches are needed
62
64
@@ -69,7 +71,7 @@ Custom patches are required when:
69
71
- **Fused/batched expert weights** — experts stored as a single parameter (e.g., 3D `[num_experts, in, out]`) rather than separate `nn.Linear` modules → Pattern 1 + 3
70
72
- **Self-defined weight parameters** (`nn.Parameter` used directly instead of `nn.Linear`) — common in non-HF or research models → Pattern 1 + 3
71
73
- **VLM structure** (vision encoder that should be excluded) → Pattern 4
72
-
- **FP8 checkpoint** that needs dequantization before re-quantizing → Pattern 5
74
+
- **FP8 checkpoint with non-standard parameter names**(standard `FP8Linear` is handled automatically by the `_QuantFP8Linear` plugin) → Pattern 5
73
75
74
76
## Step D — Check weight names against ModelOpt's config patterns
75
77
@@ -187,7 +189,9 @@ Both methods replace all instances of `original_cls` with `quantized_cls` during
187
189
188
190
## Pattern 4: VLM Language Model Extraction
189
191
190
-
For multimodal models, only quantize the language model backbone:
192
+
**Note**: `hf_ptq.py` already handles VLMs automatically via `extract_and_prepare_language_model_from_vl()`. It detects multimodal models, extracts the language backbone, and disables quantization for vision/projector modules. This works for most VLMs (tested with Mistral3/Devstral, Nemotron VL, Llama VL, etc.) — try `hf_ptq.py` first before writing custom VLM handling.
193
+
194
+
For custom scripts or when `hf_ptq.py` doesn't handle the VLM correctly, only quantize the language model backbone:
191
195
192
196
```python
193
197
from modelopt.torch.export.model_utils import get_language_model_from_vl, is_multimodal_model
**Known VLM export issue**: The export step (`requantize_resmooth_fused_llm_layers` in `unified_export_hf.py`) may try to run a dummy forward pass on the full VLM instead of the language model backbone. This currently only handles Nemotron VLMs. If hit, patch the export to use `is_multimodal_model()` for the VLM check instead of model-specific string matching.
220
224
221
-
## Pattern 5: FP8 Checkpoint Dequantization
225
+
## Pattern 5: FP8 Checkpoint Handling
226
+
227
+
### Standard FP8Linear modules (preferred — no action needed)
HuggingFace handles these automatically with `dequantize=True`:
231
+
1. Keeps weights **compact in FP8**in GPU memory during calibration
232
+
2. **Dequantizes lazily** on-the-fly during calibration forward passes via `weight_dequant()`
233
+
3. Has `unpack_weight()`for full dequantization at exporttime
234
+
235
+
This is registered automatically for`transformers.integrations.finegrained_fp8.FP8Linear`. It requires **Triton** to be installed (used internally for FP8 dequantization kernels). Just load the model normally — no `FineGrainedFP8Config(dequantize=True)` needed:
226
236
227
237
```python
228
-
from transformers.utils.quantization_config import FineGrainedFP8Config
model = AutoModel.from_pretrained(model_path, device_map="auto", torch_dtype="auto")
239
+
# FP8Linear modules stay in FP8 → _QuantFP8Linear handles dequant during calibration
236
240
```
237
241
242
+
**Do NOT use `FineGrainedFP8Config(dequantize=True)`** — it expands the entire model to BF16 upfront, wasting ~2x GPU memory. The plugin approach is both more memory-efficient and simpler.
243
+
238
244
### Non-standard parameter names (e.g., 3D expert weights)
239
245
240
-
HF's `WeightConverter` uses source patterns `["weight$", "weight_scale_inv", "activation_scale"]`. Parameters with names like `gate_up_proj`, `down_proj`, `w1`, `w2`, `w3` won't match these patterns and will remain in FP8 after loading. Dequantize them manually:
246
+
The `_QuantFP8Linear` plugin only handles standard 2D `FP8Linear` modules with `weight` + `weight_scale_inv`. Parameters with non-standard names (e.g.,`gate_up_proj`, `down_proj`, `w1`/`w2`/`w3`in fused MoE experts) won't be covered. For these, dequantize manually after loading:
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@ Changelog
27
27
28
28
- [Security] Changed the default of ``weights_only`` to ``True`` in ``torch.load`` for secure checkpoint loading. If you need to load a checkpoint that requires unpickling arbitrary objects, first register the class in ``torch.serialization.add_safe_globals([cls])`` before loading. Added :meth:`safe_save <modelopt.torch.utils.serialization.safe_save>` and :meth:`safe_load <modelopt.torch.utils.serialization.safe_load>` API to save and load checkpoints securely.
29
29
- Bump minimum required PyTorch version to 2.8.
30
-
- [Experimental] Add support for transformers>=5.0. Unified Hugging Face checkpoint export for quantized checkpoints may not work for MoE models with transformers>=5.0 yet.
30
+
- [Experimental] Add support for transformers>=5.0, including generic PTQ and unified HF checkpoint export for fused MoE expert modules (Mixtral, Qwen2-MoE, Qwen3-MoE, Qwen3.5-MoE, DeepSeek-V3, Jamba, OLMoE, etc.).
31
31
- Improve ``megatron_preprocess_data``: add ``--reasoning_content`` support for Nemotron v3 datasets, eliminate intermediate JSONL for HuggingFace datasets, return output file prefixes from the Python API, add gzip input support (``.jsonl.gz``), add ``--strip_newlines`` flag for plain-text pretraining data, add ``--hf_streaming`` for very large datasets (only consumed rows downloaded), and auto-shuffle when ``--hf_max_samples_per_split`` is set to avoid biased sampling.
- NVIDIA Apache 2.0 license header on ALL new Python/C++/CUDA files (see `LICENSE_HEADER`)
16
+
- NVIDIA Apache 2.0 license header on ALL new Python/C++/CUDA files — use the SPDX format from `LICENSE_HEADER` (auto-inserted by pre-commit for most files, but must be added manually for files copied from third-party sources, which are excluded from the hook)
15
17
-`git commit -s -S` (DCO sign-off + cryptographic signing required). Never attribute AI tools in
16
18
sign-off line
17
19
-`pre-commit` hooks run on commit — if files are modified by hooks, re-stage and commit again
18
20
- PRs require CODEOWNERS review (auto-assigned based on `.github/CODEOWNERS`)
19
21
- After rebasing, always re-run tests locally before pushing
20
22
- All code must follow the security guidelines in `SECURITY.md` — violations are blocked as pre-merge errors
21
23
- For contribution guidelines, commit conventions, and PR requirements, see `CONTRIBUTING.md`
24
+
- New PIP dependencies require license verification — non-permissive licenses need justification and approval from `@NVIDIA/modelopt-setup-codeowners`
| Single test file |`python -m pytest tests/unit/torch/quantization/test_quant_config.py`|
32
37
| Pattern match |`pytest tests/unit -k "test_quantize"`|
33
38
| Lint + format (all files) |`pre-commit run --all-files`|
34
39
| Lint (diff only) |`pre-commit run --from-ref origin/main --to-ref HEAD`|
@@ -69,6 +74,11 @@ A **mode** is the unit of model optimization in ModelOpt. Each algorithm (quanti
69
74
etc.) is implemented as one or more modes. Modes are recorded in the model's `modelopt_state` so
70
75
optimization workflows can be composed, saved, and restored.
71
76
77
+
The main entry points are in `modelopt/torch/opt/conversion.py`:
78
+
-`apply_mode(model, mode, ...)` — applies an optimization mode to a model
79
+
-`restore(model, ...)` — restores a model to a previously saved optimization state
80
+
-`save(model, ...)` / `modelopt_state(model)` — captures the current optimization state
81
+
72
82
### Core Abstraction: Recipes
73
83
74
84
A **recipe** is a declarative YAML specification of an optimization configuration. Recipes decouple optimization specs from code, enabling reuse, sharing, and version control.
0 commit comments