Skip to content

Puzzletron README initial setup fails with number of issues #1637

@danielkorzekwa

Description

@danielkorzekwa

Using: https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/puzzletron/README.md

Doing an initial puzzletron setup up to this sanity check fails with number of issues: python -m pytest tests/gpu/torch/puzzletron/test_puzzletron.py -k "Qwen3-8B"

  1. Why in Nemo 26_02 there is nvidia-modelopt 0.43.0rc1 installed and not 0.44?

To reproduce:

enroot import --output ./docker/nemo_26_02.sqsh docker://nvcr.io/nvidia/nemo:26.02

export EXPERIMENT_DIR=.../dkorzekwa/experiments/6_5_qwen_35_moments_lab

submit_job (srun wrapper) --partition interactive --time 4 --image  $EXPERIMENT_DIR/docker/nemo_26_02.sqsh --mounts $EXPERIMENT_DIR:/workspace --interactive --gpu 8

 python -m pip list |grep modelopt

after calling python -m pip install -e ".[hf,puzzletron,dev-test]":

nvidia-modelopt                             0.45.0.dev164+g115cae258
  1. “...Once inside the container with the repo available, install dependencies from the repo root: …” - unclear what is “repo root”, I assume it is ModelOpt source repo, can we clarify it?

  2. Why is it required to install modelopt from sources given it is already installed in the nemo container? is similar approach needed for other compression algorithms in modelopt?

python -m pip install -e ".[hf,puzzletron,dev-test]"
  1. python -m pytest tests/gpu/torch/puzzletron/test_puzzletron.py -k "Qwen3-8B fails, adding -o addopts="" makes it working.

  2. Why are both needed?

python -m pip install -e ".[hf,puzzletron,dev-test]"
python -m pip install -r examples/puzzletron/requirements.txt

can we simplify it?

  1. python -m pip install -e ".[hf,puzzletron,dev-test]" shows an error:
Uninstalling nvidia-modelopt-0.43.0rc1:
      Successfully uninstalled nvidia-modelopt-0.43.0rc1
  Attempting uninstall: peft
    Found existing installation: peft 0.13.2
    Uninstalling peft-0.13.2:
      Successfully uninstalled peft-0.13.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
nemo-export-deploy 0.4.0rc0 requires peft<0.14.0, but you have peft 0.19.1 which is incompatible.
tensorrt-llm 1.1.0 requires fastapi<=0.121.3,>=0.120.1, but you have fastapi 0.135.1 which is incompatible.
tensorrt-llm 1.1.0 requires nvidia-cutlass-dsl==4.2.1; python_version >= "3.10", but you have nvidia-cutlass-dsl 4.4.2 which is incompatible.
tensorrt-llm 1.1.0 requires setuptools<80, but you have setuptools 81.0.0 which is incompatible.
tensorrt-llm 1.1.0 requires transformers==4.56.0, but you have transformers 4.57.6 which is incompatible.
tensorrt-llm 1.1.0 requires wheel<=0.45.1, but you have wheel 0.46.3 which is incompatible.
Successfully installed deepspeed-0.19.1 dependency-groups-1.3.1 fire-0.7.1 hjson-3.1.0 humanize-4.15.0 lru-dict-1.4.1 nox-2026.4.10 nvidia-modelopt-0.45.0.dev164+g115cae258 peft-0.19.1 pytest-cov-7.1.0 pytest-instafail-0.5.0 termcolor-3.3.0 torch-geometric-2.7.0 wonderwords-3.0.1
  1. python -m pytest tests/gpu/torch/puzzletron/test_puzzletron.py -o addopts="" -k "Qwen3-8B" fails with
 File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/Model-Optimizer/tests/_test_utils/torch/distributed/utils.py", line 53, in init_process
    job(rank, size)
  File "/workspace/Model-Optimizer/tests/gpu/torch/puzzletron/test_puzzletron.py", line 202, in _test_puzzletron_multiprocess_job
    pytest.fail(
  File "/opt/venv/lib/python3.12/site-packages/_pytest/outcomes.py", line 163, in __call__
    raise Failed(msg=reason, pytrace=pytrace)
Failed: 2 assertion(s) failed for Qwen/Qwen3-8B:
  - Teacher memory mismatch for Qwen/Qwen3-8B: expected 395.63, got 1582.13720703125
  - Teacher num_params mismatch for Qwen/Qwen3-8B: expected 6096640, got 24189184
  1. To use puzzletron on a slurm based cluster, I had to figure out what enroot command to use to download the image and then learn how to use slurm. Is there some wiki in modelopt that shows how to use modelopt using different types of infrastructures, e.g. in my case slurm-based on-prem.

Metadata

Metadata

Labels

bugSomething isn't workingdocumentationImprovements or additions to documentationtriaged

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions