Add V-JEPA 2 (Meta FAIR) distributed training test case by paragao · Pull Request #1035 · awslabs/awsome-distributed-training

paragao · 2026-03-23T09:36:58Z

Summary

Add V-JEPA 2 (Meta FAIR) ViT-g/16 1B-parameter self-supervised video model as a new PyTorch distributed training test case
Includes Slurm (Pyxis/Enroot) and Kubernetes (PyTorchJob) deployment manifests
Benchmarked on 8x p5en.48xlarge (64x NVIDIA H200 GPUs)

What is V-JEPA 2?

V-JEPA 2 is Meta FAIR's self-supervised video model that learns visual representations by predicting masked video patches. It achieves state-of-the-art on motion understanding and human action anticipation benchmarks. The ViT-g/16 variant has 1.03B encoder parameters.

Files Added

3.test_cases/pytorch/vjepa2/
├── vjepa2.Dockerfile                      # NVIDIA PyTorch 25.03 base (CUDA 13, Python 3.11)
├── README.md                              # Full walkthrough with benchmark results
├── slurm/
│   ├── benchmark_training.sbatch          # 200-iter benchmark (8 nodes)
│   ├── launch_training.sbatch             # Full 800-epoch pre-training
│   └── download_dataset.sbatch            # SSv2 dataset preparation
├── kubernetes/
│   └── vjepa2-benchmark.yaml              # PyTorchJob for EKS clusters
├── configs/
│   ├── benchmark-vitg-8nodes.yaml         # Quick benchmark config
│   └── pretrain-vitg-256px-16f.yaml       # Full pre-training config
└── scripts/
    ├── run_train.py                       # Thin srun-compatible launcher
    ├── generate_synthetic_dataset.py      # Synthetic video generator
    ├── prepare_ssv2.py                    # SSv2 CSV preparation
    ├── parse_benchmark.py                 # Log parser for throughput/MFU
    └── test_decord.py                     # Verify decord video loading

Key Technical Details

Launch pattern: V-JEPA 2 uses srun directly (not srun + torchrun). The run_train.py launcher calls app.vjepa.train.main() directly, which reads SLURM_LOCALID/SLURM_NTASKS/SLURM_PROCID for distributed setup. This avoids a bug in app/main.py where its subprocess launcher passes world_size=1 regardless of SLURM configuration.

Dataset: Supports both Something-Something v2 (SSv2) real data and synthetic generated videos for benchmarking.

Benchmark Results (8x p5en.48xlarge, 64x H200)

Metric	Value
Global batch size	1,536
Precision	BF16
Peak GPU memory	~32.9 GB / 143 GB

Testing

Validated on ParallelCluster with 8x p5en.48xlarge nodes running Slurm + Pyxis/Enroot with EFA networking. Job ran 200 iterations to completion with all 64 ranks correctly initialized via NCCL over EFA.

Add V-JEPA 2 (Meta FAIR) ViT-g/16 1B-param self-supervised video model as a new PyTorch test case with Slurm and Kubernetes support. Includes: - Dockerfile based on nvcr.io/nvidia/pytorch:25.03-py3 (CUDA 13 + Python 3.11) - Slurm sbatch scripts for benchmark (200 iters) and full pre-training (800 epochs) - Kubernetes PyTorchJob manifest for EKS clusters - Thin srun-compatible launcher (run_train.py) that calls app.vjepa.train.main() directly, avoiding the subprocess world_size=1 bug in app/main.py - Synthetic dataset generator for benchmarking without SSv2 download - SSv2 dataset preparation scripts and decord verification - YAML configs for ViT-g/16 with DDP, BF16, and activation checkpointing

…ining Add V-JEPA 2.1 (Meta FAIR) ViT-g/16 1B-param benchmark alongside the existing V-JEPA 2 test case. V-JEPA 2.1 introduces Dense Predictive Loss, Deep Self-Supervision (4 intermediate layers), doubled predictor depth (24 vs 12), and image+video co-training with 50/50 rank split. Includes: - Dockerfile and Enroot container setup (shared base with V-JEPA 2) - Slurm sbatch scripts with /workspace code overlay for latest vjepa2 repo - Kubernetes PyTorchJob manifest for EKS clusters - Synthetic image generator for co-training benchmarks - run_train.py launcher using app.scaffold.main() for dynamic dispatch - YAML configs with img_data, img_mask, and rank_ratio settings Key discovery: the container must have the latest vjepa2 repo code (post March 2026) for app/vjepa_2_1/ to be available. The sbatch scripts mount updated code at /workspace to overlay the container's stale PYTHONPATH.

KeitaW

Review Batch 1/3 — Structure & Repository Hygiene

Thanks for this thorough contribution, Paulo! The utility scripts and READMEs are excellent quality. I have some structural and reproducibility findings below.

Significant code duplication between `vjepa2/` and `vjepa2.1/`

These two directories share a large amount of identical code:

scripts/generate_synthetic_dataset.py — identical (same git blob 74f922445)
scripts/parse_benchmark.py — identical (same git blob 957b9efdf)
scripts/prepare_ssv2.py — identical (same git blob 633288d17)
scripts/test_decord.py — identical (same git blob 4881d1647)
scripts/run_train.py — nearly identical (V-JEPA 2.1 adds 4 lines)
Dockerfiles — nearly identical structure
Slurm sbatch scripts — same structure, differing only in paths/config references

The repo convention says to "extend the existing test case — add platform-specific subdirectories, parameterize scripts for additional models, or add configuration variants — rather than creating a parallel directory tree with duplicated Dockerfiles, training scripts, and utilities."

I'd suggest consolidating into a single vjepa2/ directory that supports both V-JEPA 2 and 2.1 via different configs. The run_train.py launcher already dispatches based on the app field in the config (vjepa vs vjepa_2_1), so both versions can share the same launcher, scripts, Dockerfile, and sbatch templates. The V-JEPA 2.1 additions (image co-training, synthetic image generator) would simply add to the existing directory.

Missing license headers on README and config files

Both README.md files and all 4 configs/*.yaml files are missing license headers. The Slurm scripts, Python files, K8s manifests, and Dockerfiles all have them, so this is just an oversight. I'd suggest adding the standard header as a YAML comment in configs and HTML comment in READMEs.

KeitaW

Review Batch 2/3 — Deployment Pipeline

KeitaW

Review Batch 3/3 — Documentation Consistency

Things That Look Great

Comprehensive utility scripts: The synthetic data generators (video and image), SSv2 CSV preparer, benchmark log parser, and decord test script form a complete toolkit that makes this test case truly self-contained.
Excellent README documentation: Both READMEs walk through every step from dataset prep to result parsing, with clear architecture notes explaining the srun direct launch pattern and why app/main.py doesn't work with SLURM.
Smart launch pattern: Using app.scaffold.main() to dispatch based on the config's app field is elegant and avoids the world_size=1 bug in app/main.py.
Proper license headers on most files: Scripts, Dockerfiles, sbatch files, and K8s manifests all have the standard copyright header.
HyperPod auto-resume detection: The if [ -d "/opt/sagemaker_cluster" ] pattern in sbatch scripts correctly detects HyperPod clusters and enables auto-resume.
Both Slurm and Kubernetes deployment paths: Providing PyTorchJob manifests alongside Slurm scripts makes this accessible to EKS-based clusters too.
Well-structured config separation: Benchmark configs (200 iterations, no checkpointing) vs. full pre-training configs (800+ epochs, regular checkpoints) give users clear starting points for different use cases.
V-JEPA 2.1 comparison table: The feature comparison table in the V-JEPA 2.1 README clearly explains what changed between versions.

KeitaW

Few comments

The Dockerfile-based container (pytorch:25.03-py3) ships NCCL 2.25 and an older aws-ofi-nccl plugin that are incompatible with B200 EFA networking. The B200 scripts use a NeMo container with NCCL 2.29+ and a matching OFI/EFA/libfabric stack instead, with V-JEPA dependencies installed to shared storage and added to PYTHONPATH at runtime.

Benchmarking on B200 revealed that 5,000 synthetic samples caused frequent data loader re-initialization between epochs, inflating V-JEPA 2.1 iteration times by up to 4x (15,300ms vs 4,075ms with 50K samples). V-JEPA 2 was less affected but still improved from 1,637ms to 1,457ms. Changes: - Default synthetic video count: 5,000 -> 50,000 (both V-JEPA 2 and 2.1) - Default synthetic image count: 5,000 -> 50,000 (V-JEPA 2.1) - Add OpenCV (cv2) fallback for video generation in environments without ffmpeg - Add dataset sizing guidance to benchmark configs and READMEs

Add rank-selective nsys profiling infrastructure: - nsys_wrapper.sh: profiles only rank 0 via SLURM_PROCID check - nsys_profile_b200.sbatch: configurable via NSYS_PROFILE_DIR and CONFIG env vars to save each optimization phase to a separate folder - Document profiling workflow in both READMEs

…n checkpointing Provide an optimized benchmark config for B200 GPUs: - compile_model: true for fused kernels (~20% GPU speedup) - use_activation_checkpointing: false (trades ~95 GB vs ~33 GB memory) - num_workers: 20 for higher data prefetch Tested at 1,125 ms/iter vs 1,457 ms baseline (23% improvement).

BF16 has the same dynamic range as FP32, so GradScaler's loss scaling is pure overhead. Monkey-patch GradScaler to enabled=False in both run_train.py launchers when meta.dtype is bfloat16, eliminating the scale/unscale/step/update cycle per iteration.

…higher throughput Replace DDP with FSDP (SHARD_GRAD_OP / ZeRO-2) for the encoder and target_encoder in V-JEPA 2.1, sharding gradients and optimizer states across ranks. This saves ~15 GB/GPU, enabling activation checkpointing to be disabled on B200 GPUs for higher throughput. The predictor remains DDP-wrapped (small model, needs find_unused_parameters).

…_train.py - Fix compile_model placement: move from meta: to model: section where upstream train.py actually reads it (was silently never enabled) - Add env-var-driven optimizations to run_train.py: fused AdamW, TF32, compile mode override, gradient_as_bucket_view, prefetch_factor - Add B200 optimization sweep sbatch (Phase A-D with nsys profiling) - Add nsys profiling sbatch scripts for H200 (vjepa2 and vjepa2.1) - Fix container-workdir from /vjepa2 to /workspace in benchmark sbatch - Add .gitignore to exclude benchmarks/ and profiling/ from repo

…paths

…u_type flag The 6*N*D FLOP formula overestimates training FLOPs by ~2x for JEPA architectures because the context encoder only processes visible tokens (~15% of the sequence) while the target encoder runs forward-only (no backward pass). Replace with samples/sec as the primary throughput metric. Add --gpu_type flag (h200/b200) with correct BF16 peak specs (989.4 / 2250 TFLOPS). Fix V-JEPA 2.1 script title and update README parse examples to use new flag.

…ctness, pinned versions - Remove ,eth from NCCL_SOCKET_IFNAME exclusion list for correct TCP bootstrap - Add missing MIT-0 license headers to .gitignore, README.md, and config YAMLs - Change set -ex to set -euo pipefail in all sbatch and shell scripts - Pin EFA_INSTALLER_VERSION to 1.47.0 in both Dockerfiles (was 'latest') - Replace :latest image tags with :vjepa2 and :vjepa2.1 in K8s manifests - Use yaml.SafeLoader instead of yaml.FullLoader in run_train.py and run_train_fsdp.py

…lone - Pin all pip packages to tested versions from cluster container freeze - Pin vjepa2 git clone to commit 204698b4 (latest as of Mar 23, 2026) - Fix stale repository URL from aws-samples to awslabs in both READMEs

paragao · 2026-04-15T21:24:44Z

@KeitaW please check that everything has been properly addressed so we can merge this.

KeitaW

Re-Review Batch 1/3 — Acknowledgments + Structure & Repository Hygiene

Thanks Paulo for working through the previous round of feedback. Most of the smaller items from the first pass are now resolved — pinned EFA, pinned vjepa2 commit, pinned pip packages, license headers added, eth removed from NCCL_SOCKET_IFNAME, awslabs URL fix, :latest removed from K8s images, set -euo pipefail adopted, and yaml.SafeLoader swapped in. Nice cleanup pass.

The main item still outstanding is the structural one: vjepa2/ and vjepa2.1/ are still two parallel trees with mostly-identical Dockerfiles, scripts, and sbatch files. Several new files added in the recent commits (FSDP launcher, nsys wrappers, B200 sbatches, sweep configs) duplicate that surface again. I'd really like to land that consolidation before merge — it's the single biggest maintenance lever for this contribution.

What's been addressed since the last review

EFA pinned to 1.47.0 (meets CI minimum)
vjepa2 git clone pinned to commit 204698b4...
All pip packages pinned to tested versions
License headers added to READMEs, configs, and .gitignore
NCCL_SOCKET_IFNAME no longer excludes eth* (now ^docker,lo,veth everywhere)
set -euo pipefail adopted across sbatch scripts
K8s :latest replaced with :vjepa2 / :vjepa2.1 tags
Stale aws-samples URL replaced with awslabs in both READMEs
yaml.SafeLoader instead of yaml.FullLoader in launchers

Code duplication between `vjepa2/` and `vjepa2.1/` is still the headline issue

Looking at the current state, the duplication has actually grown since the last review. The two trees now share, in addition to the originally-flagged scripts, near-identical copies of:

vjepa2.Dockerfile and vjepa2_1.Dockerfile — the only meaningful difference is one extra pinned package (Pillow==12.0.0); the vjepa2_1.Dockerfile even comments at the top "V-JEPA 2.1 reuses the same container as V-JEPA 2, since both training apps live in the same facebookresearch/vjepa2 repository"
scripts/nsys_wrapper.sh — identical
slurm/benchmark_training_b200.sbatch, slurm/nsys_profile.sbatch, slurm/nsys_profile_b200.sbatch, slurm/optim_sweep_b200.sbatch — same skeleton, only paths and config refs differ
README sections on prerequisites, container build, parsing, and profiling

The vjepa2.1/README.md even reaches across the boundary several times — ## 2. Datasets says "See the V-JEPA 2 test case for SSv2 download" and the benchmark-results section points at ../vjepa2/benchmarks/vjepa2-benchmark-results.md. Cross-references like that are a strong tell that the directories want to be one.

run_train.py already dispatches via app.scaffold.main(params["app"], ...) based on the YAML's app field, so a single launcher (and Dockerfile, and sbatch templates) can serve both app: vjepa and app: vjepa_2_1 configs. I'd suggest restructuring as:

3.test_cases/pytorch/vjepa2/
├── README.md                    # covers both 2 and 2.1
├── vjepa2.Dockerfile            # single file, includes Pillow
├── configs/
│   ├── vjepa2/...               # app: vjepa configs
│   └── vjepa2.1/...             # app: vjepa_2_1 configs
├── scripts/
│   ├── run_train.py             # current scaffold-based launcher
│   ├── generate_synthetic_dataset.py
│   ├── generate_synthetic_images.py
│   ├── prepare_ssv2.py
│   ├── parse_benchmark.py
│   ├── nsys_wrapper.sh
│   └── test_decord.py
└── slurm/                       # one set of sbatches, parameterized via env vars

This is the same pattern called out in CONTRIBUTING.md ("extend the existing test case … rather than creating a parallel directory tree with duplicated Dockerfiles, training scripts, and utilities"). Happy to help if there's a structural reason this is harder than it looks.

Reconsider whether `run_train_fsdp.py` should ship at all

The FSDP path adds a 718-line bespoke training loop (vjepa2.1/scripts/run_train_fsdp.py), plus a config and sbatch script. The README itself says:

Benchmark finding: FSDP was found to be ~2x slower than baseline DDP in benchmarks (iter ~8,200 ms vs ~4,075 ms on video ranks). … The baseline DDP config is the recommended configuration for V-JEPA 2.1.

If the recommendation is "don't use this", I'd lean toward dropping the FSDP variant from the PR and reintroducing it later if/when it becomes useful. Carrying ~900+ lines of "for reference, but slower" code in awsome-distributed-training increases the maintenance surface (NCCL/PyTorch/EFA upgrades will need to keep it green) without users having a reason to run it. If you want to keep the finding documented, a short paragraph in the README is probably enough.

`optim_sweep_b200.sbatch` reads as developer experimentation, not a reference workflow

vjepa2.1/slurm/optim_sweep_b200.sbatch (+222 lines) and vjepa2/slurm/optim_sweep_b200.sbatch (+171 lines) iterate over phase configs and tweak env vars to sweep optimizations. They look like the harness you used to produce the benchmark numbers, not something a downstream user would run. The repo convention favors test cases that demonstrate a single, reproducible workflow; sweeps usually live in a personal branch or a README appendix.

I'd suggest either dropping these (along with the phase1/phase4 configs), or — if you want to keep the sweep methodology in-tree — moving it into a tools/ subdirectory with a README that frames it as an experimentation example, not a recommended config.

KeitaW · 2026-04-28T04:47:20Z

+# V-JEPA 2.1 reuses the same container as V-JEPA 2, since both training
+# apps live in the same facebookresearch/vjepa2 repository.


This comment is the strongest argument for consolidating the two directories

If 2.1 is acknowledged in-tree as reusing the V-JEPA 2 container, having a separate vjepa2_1.Dockerfile (whose only delta from vjepa2.Dockerfile is one extra pinned package) creates ongoing drift risk — a CUDA/EFA/NCCL bump applied to one will need to be repeated on the other. I'd suggest deleting this file in favor of a single Dockerfile that includes Pillow==12.0.0 (it's harmless on the V-JEPA 2 side too).

See the structural finding in the review body for the proposed unified layout.

KeitaW

Re-Review Batch 2/3 — Deployment Pipeline (K8s / Slurm)

Three concrete issues introduced or surfaced by the most recent commits:

The new set -euo pipefail interacts badly with bare ${PYTHONPATH} in the B200 sbatches — set -u will cause the script to exit before srun if PYTHONPATH is unset on the user's environment (the common case). Fix is the standard ${PYTHONPATH:-} guard, suggested inline on each affected file.
The apt install libnccl-dev … || true line in both Dockerfiles silently swallows failures and doesn't pin a version, while CI requires NCCL >= 2.28 and the base nvcr.io/nvidia/pytorch:25.03-py3 ships NCCL 2.25 (per your own commit message c0fca12). Worth either dropping the || true so failures surface, or making it explicit that NCCL is provided by the base image.
Minor: the K8s image tags vjepa2:vjepa2 / vjepa2:vjepa2.1 use the variant name as the tag, which is unusual. A version-flavored tag (e.g. mirroring the upstream commit pin or the base image date) sets a better template for users.

KeitaW · 2026-04-28T04:47:24Z

+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+
+# V-JEPA 2.1 needs /workspace in PYTHONPATH for the code overlay
+export PYTHONPATH=/workspace:${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH}


Bare ${PYTHONPATH} will break under set -u

With the new set -euo pipefail at the top of the script, this line will fail with PYTHONPATH: unbound variable on any node where PYTHONPATH isn't already set in the user's environment — which is the common case under Slurm/Pyxis. The script will exit before srun even runs.

Use the standard guard:

Suggested change

export PYTHONPATH=/workspace:${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH}

export PYTHONPATH=/workspace:${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH:-}

A quick env -i bash -c 'set -u; . path/to/script.sbatch' smoke test should catch the rest of these.

KeitaW · 2026-04-28T04:47:25Z

+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+
+# V-JEPA 2.1 needs /workspace in PYTHONPATH for the code overlay
+export PYTHONPATH=/workspace:${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH}


Same set -u × bare ${PYTHONPATH} issue as benchmark_training_b200.sbatch

Suggested change

export PYTHONPATH=/workspace:${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH}

export PYTHONPATH=/workspace:${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH:-}

KeitaW · 2026-04-28T04:47:25Z

+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+
+# Add V-JEPA 2 deps and source code to Python path
+export PYTHONPATH=${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH}


Same set -u × bare ${PYTHONPATH} issue

Suggested change

export PYTHONPATH=${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH}

export PYTHONPATH=${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH:-}

KeitaW · 2026-04-28T04:47:25Z

+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+
+# Add V-JEPA 2 deps and source code to Python path
+export PYTHONPATH=${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH}


Same set -u × bare ${PYTHONPATH} issue

Suggested change

export PYTHONPATH=${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH}

export PYTHONPATH=${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH:-}

KeitaW · 2026-04-28T04:47:25Z

+    cd /tmp && rm -rf aws-efa-installer
+
+# Install NCCL OFI plugin for EFA
+RUN apt-get update && apt-get install -y libnccl-dev && rm -rf /var/lib/apt/lists/* || true


Silent-fail apt install libnccl-dev with no NCCL version pin

Two concerns:

The trailing || true masks any apt failure — if libnccl-dev is missing from the index or fails to install, the build still succeeds and the user only finds out at training time.

There's no explicit NCCL version pin, while the CI version check requires NCCL >= 2.28. The base nvcr.io/nvidia/pytorch:25.03-py3 ships NCCL 2.25 (per commit message c0fca12), and apt install libnccl-dev from the default Ubuntu repo typically pulls a separately-versioned package.

I'd suggest dropping the || true so failures surface, or adding an explicit comment that NCCL is provided by the base image and the apt step is best-effort. Either way, document which NCCL version the final image actually carries so users (and the CI version-check job) know the H200 reference path meets the minimum. The B200 path side-steps this by using a NeMo container, but the H200 reference path needs to demonstrably be >= 2.28.

KeitaW · 2026-04-28T04:47:25Z

+    cd /tmp && rm -rf aws-efa-installer
+
+# Install NCCL OFI plugin for EFA
+RUN apt-get update && apt-get install -y libnccl-dev && rm -rf /var/lib/apt/lists/* || true


Same silent-fail libnccl-dev install + no NCCL version pin as vjepa2.Dockerfile

See the comment on vjepa2/vjepa2.Dockerfile:23 for the full reasoning. (Another data point that these two Dockerfiles want to be one file.)

KeitaW · 2026-04-28T04:47:25Z

+          containers:
+            - name: vjepa2
+              # Replace with your ECR image URI
+              image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/vjepa2:vjepa2


vjepa2:vjepa2 is an awkward tag convention

Using the variant name as the tag is unusual — most readers will expect the variant to be the repository name, with the tag carrying a version or commit. I'd suggest aligning with the upstream commit you've pinned in the Dockerfile, e.g.:

Suggested change

image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/vjepa2:vjepa2

image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/vjepa2:204698b4

…or mirroring the base image's date tag (vjepa2:25.03-py3). Minor, but it sets a better template for users copy-pasting this into their own deployments.

KeitaW · 2026-04-28T04:47:25Z

+          containers:
+            - name: vjepa2-1
+              # Replace with your ECR image URI
+              image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/vjepa2:vjepa2.1


Same awkward vjepa2:vjepa2.1 tag convention as vjepa2-benchmark.yaml

See the comment on vjepa2/kubernetes/vjepa2-benchmark.yaml:42. A version-flavored tag (commit SHA or date) reads more naturally than using the variant name as the tag.

KeitaW

Re-Review Batch 3/3 — Documentation Consistency + Things That Look Great

One cross-cutting documentation item, plus inline picks below:

B200 setup is documented only inside the sbatch comments

The B200 sbatches (vjepa2/slurm/benchmark_training_b200.sbatch, _optimized.sbatch, vjepa2.1/slurm/benchmark_training_b200.sbatch, _fsdp.sbatch) require a separate NeMo container, a cloned vjepa2_code/ directory on FSx, and a pip install --target /fsx/.../vjepa_deps step. None of this is mentioned in either README, so a user going directly through Build Container → Run Benchmark with B200 instances will hit unexplained CONTAINER_IMAGE paths. I'd suggest adding a short B200 setup subsection to the README, or at least pointing to the sbatch's header comment from the README.

Things That Look Great

Genuine engagement with the previous review — all the small/medium items are visibly fixed, with commit messages that explain the why (e.g. 9a946379 calls out the cluster container freeze as the source for pinned versions).
The B200 vs H200 split is well-motivated — keeping the H200 Dockerfile path simple while running B200 against a NeMo container with a known-compatible NCCL/EFA stack is a pragmatic choice and the commit message c0fca12 explains it clearly.
Optimization knobs in run_train.py are tasteful — env-var-gated monkey-patches (VJEPA_TF32, VJEPA_FUSED_OPTIMIZER, VJEPA_GRAD_BUCKET_VIEW, VJEPA_PREFETCH_FACTOR) keep the upstream code unmodified while making perf experiments easy to compose.
GradScaler-for-BF16 fix with the subclass approach (so Apex's GradScaler still inherits cleanly) is a nice touch and documented inline.
.gitignore for benchmarks/ and profiling/ keeps generated artifacts out of the repo while still letting you keep the directory referenced from the README.
Honest documentation of negative results — calling out that FSDP was 2× slower and torch.compile + activation checkpointing was 55% slower is exactly the kind of empirical guidance that makes this repo useful (even if I think the FSDP code itself probably doesn't need to ship — see Batch 1).
Synthetic-dataset sizing guidance in the README (50K minimum, with the rationale about dataloader re-init) is the kind of subtle benchmarking pitfall that's exactly worth documenting.

Suggested priority before merge

Consolidate vjepa2/ and vjepa2.1/ into a single test case (Batch 1) — the headline blocker, and the rationale for several other findings.
Fix the set -u × ${PYTHONPATH} bug in the four B200 sbatches (Batch 2) — small but currently breaks the scripts on common environments.
Decide what to do with run_train_fsdp.py and optim_sweep_b200.sbatch (Batch 1) — drop or relocate.
Address the smaller documentation/Dockerfile items at your discretion.

KeitaW · 2026-04-28T04:47:30Z

+tail -f logs/vjepa21_benchmark/<JOB_ID>.out
+```
+
+The benchmark runs 200 iterations across 8 nodes (64 GPUs). With `rank_ratio=0.5`, ranks 0-31 process images (batch size 72 per GPU) and ranks 32-63 process video (batch size 24 per GPU).


Per-GPU video batch size is reported two different ways in the same README

This section says ranks 32-63 use batch size 24 per GPU. The Image/Video co-training rank split section further down (line 199) instead says batch size 48/GPU = 1,536 videos/step for the same ranks.

I think the difference is that the first is the YAML-configured batch_size and the second is the effective size after the V-JEPA 2.1 code auto-doubles it to keep the global batch constant when rank_ratio=0.5. That's worth saying once, in one place — right now the two numbers look like a contradiction. Either fold the auto-adjust sentence (line 201) up here, or drop the (batch size 24 per GPU) parenthetical.

KeitaW · 2026-04-28T04:47:30Z

+
+## Benchmark Results
+
+_Benchmark results are maintained separately in `benchmarks/vjepa2-benchmark-results.md` (gitignored)._


Reference to a gitignored file is dead

benchmarks/vjepa2-benchmark-results.md is gitignored (per the .gitignore added in this PR), so it's never present in a fresh clone — readers following this line will hit nothing. If results are published elsewhere (a blog post, an internal doc, or the PR description's table), I'd link there instead. Otherwise it's cleaner to just drop the line.

Suggested change

_Benchmark results are maintained separately in `benchmarks/vjepa2-benchmark-results.md` (gitignored)._

KeitaW · 2026-04-28T04:47:30Z

+
+## Benchmark Results
+
+_Benchmark results are maintained separately in `../vjepa2/benchmarks/vjepa2-benchmark-results.md` (gitignored)._


Same dead reference to a gitignored file as vjepa2/README.md:284

The ../vjepa2/benchmarks/ directory is gitignored, so the file isn't present in a fresh clone. Drop the line, or replace it with a link to wherever the published benchmark results actually live.

Suggested change

_Benchmark results are maintained separately in `../vjepa2/benchmarks/vjepa2-benchmark-results.md` (gitignored)._

paragao added 2 commits March 23, 2026 12:24

paragao force-pushed the feat/vjepa2-distributed-training branch from 11b8971 to 92abb8c Compare March 23, 2026 12:27

KeitaW reviewed Mar 23, 2026

View reviewed changes

Comment thread 3.test_cases/pytorch/vjepa2/vjepa2.Dockerfile Outdated

Comment thread 3.test_cases/pytorch/vjepa2/vjepa2.Dockerfile Outdated

Comment thread 3.test_cases/pytorch/vjepa2/kubernetes/vjepa2-benchmark.yaml Outdated

KeitaW reviewed Mar 23, 2026

View reviewed changes

Comment thread 3.test_cases/pytorch/vjepa2/kubernetes/vjepa2-benchmark.yaml Outdated

Comment thread 3.test_cases/pytorch/vjepa2/vjepa2.Dockerfile Outdated

KeitaW reviewed Mar 23, 2026

View reviewed changes

Comment thread 3.test_cases/pytorch/vjepa2/README.md Outdated

KeitaW requested changes Mar 23, 2026

View reviewed changes

paragao added 13 commits March 23, 2026 16:40

Parallelize synthetic dataset generation scripts with --workers flag

ebddd9f

Add V-JEPA 2.1 optimization sweep with phased configs and env-var hooks

77f105d

Fix V-JEPA 2.1 batch sizes for rank_ratio=0.5 scaling and sbatch log …

aa72398

…paths

KeitaW reviewed Apr 28, 2026

View reviewed changes

Update 3.test_cases/pytorch/vjepa2.1/slurm/optim_sweep_b200.sbatch

97b7c5a

		# V-JEPA 2.1 reuses the same container as V-JEPA 2, since both training
		# apps live in the same facebookresearch/vjepa2 repository.

	export PYTHONPATH=/workspace:${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH}
	export PYTHONPATH=/workspace:${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH:-}

	export PYTHONPATH=${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH}
	export PYTHONPATH=${VJEPA_DEPS}:${VJEPA2_CODE}:${PYTHONPATH:-}

	image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/vjepa2:vjepa2
	image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/vjepa2:204698b4


		## Benchmark Results

		_Benchmark results are maintained separately in `benchmarks/vjepa2-benchmark-results.md` (gitignored)._

Conversation

paragao commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What is V-JEPA 2?

Files Added

Key Technical Details

Benchmark Results (8x p5en.48xlarge, 64x H200)

Testing

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 1/3 — Structure & Repository Hygiene

Significant code duplication between vjepa2/ and vjepa2.1/

Missing license headers on README and config files

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 2/3 — Deployment Pipeline

Uh oh!

Uh oh!

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 3/3 — Documentation Consistency

Things That Look Great

Uh oh!

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

paragao commented Apr 15, 2026

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Re-Review Batch 1/3 — Acknowledgments + Structure & Repository Hygiene

What's been addressed since the last review

Code duplication between vjepa2/ and vjepa2.1/ is still the headline issue

Reconsider whether run_train_fsdp.py should ship at all

optim_sweep_b200.sbatch reads as developer experimentation, not a reference workflow

Uh oh!

Choose a reason for hiding this comment

This comment is the strongest argument for consolidating the two directories

Uh oh!

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Re-Review Batch 2/3 — Deployment Pipeline (K8s / Slurm)

Uh oh!

Choose a reason for hiding this comment

Bare ${PYTHONPATH} will break under set -u

Uh oh!

Choose a reason for hiding this comment

Same set -u × bare ${PYTHONPATH} issue as benchmark_training_b200.sbatch

Uh oh!

Choose a reason for hiding this comment

Same set -u × bare ${PYTHONPATH} issue

Uh oh!

Choose a reason for hiding this comment

Same set -u × bare ${PYTHONPATH} issue

Uh oh!

Choose a reason for hiding this comment

Silent-fail apt install libnccl-dev with no NCCL version pin

Uh oh!

Choose a reason for hiding this comment

Same silent-fail libnccl-dev install + no NCCL version pin as vjepa2.Dockerfile

Uh oh!

Choose a reason for hiding this comment

vjepa2:vjepa2 is an awkward tag convention

Uh oh!

Choose a reason for hiding this comment

Same awkward vjepa2:vjepa2.1 tag convention as vjepa2-benchmark.yaml

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Re-Review Batch 3/3 — Documentation Consistency + Things That Look Great

B200 setup is documented only inside the sbatch comments

paragao commented Mar 23, 2026 •

edited

Loading

Significant code duplication between `vjepa2/` and `vjepa2.1/`

Code duplication between `vjepa2/` and `vjepa2.1/` is still the headline issue

Reconsider whether `run_train_fsdp.py` should ship at all

`optim_sweep_b200.sbatch` reads as developer experimentation, not a reference workflow

Bare `${PYTHONPATH}` will break under `set -u`

Same `set -u` × bare `${PYTHONPATH}` issue as `benchmark_training_b200.sbatch`

Same `set -u` × bare `${PYTHONPATH}` issue

Same `set -u` × bare `${PYTHONPATH}` issue

Silent-fail `apt install libnccl-dev` with no NCCL version pin

Same silent-fail `libnccl-dev` install + no NCCL version pin as `vjepa2.Dockerfile`

`vjepa2:vjepa2` is an awkward tag convention

Same awkward `vjepa2:vjepa2.1` tag convention as `vjepa2-benchmark.yaml`

Same dead reference to a gitignored file as `vjepa2/README.md:284`