Skip to content

feat: add DETR-ResNet50 object detection fine-tuning test case#1068

Open
aravneelaws wants to merge 4 commits intomainfrom
finetune_detr_rn50
Open

feat: add DETR-ResNet50 object detection fine-tuning test case#1068
aravneelaws wants to merge 4 commits intomainfrom
finetune_detr_rn50

Conversation

@aravneelaws
Copy link
Copy Markdown
Contributor

Purpose

Adds a new test case for fine-tuning DETR-ResNet50 for object detection on Amazon SageMaker HyperPod EKS, demonstrating the Qualcomm AI Hub + AWS integration for cloud-to-edge ML workflows.

Changes

  • Add detr_main.py: PyTorch DDP training script for DETR-ResNet50 fine-tuning with torchmetrics mAP evaluation, Qualcomm AI Hub model wrapper with custom detection heads, checkpoint resume support, and distributed training via torchrun
  • Add Dockerfile: Container image with pinned dependencies (torch 2.5.1, qai-hub-models 0.30.2, transformers 4.51.3, torchmetrics 1.6.1) and pre-downloaded model weights for air-gapped operation
  • Add kubernetes/detr-resnet50-finetune.yaml-template: Parameterized PyTorchJob manifest with elastic policy, EFA networking, NCCL tuning, and automatic checkpoint resume
  • Add README.md: Overview, architecture, prerequisites, training arguments, and expected results
  • Add kubernetes/README.md: Step-by-step deployment guide with PVC naming guidance
  • Add data/README.md: Dataset preparation instructions for Supervisely Supermarket Shelves format
  • Add .gitignore

Test Plan

Environment:

  • AWS Service: Amazon SageMaker HyperPod EKS
  • Instance type: ml.g5.8xlarge (NVIDIA A10G)
  • Number of nodes: 2

Test commands:

# Build and push container image
docker build -t ${REGISTRY}detr-resnet50-finetune:v3 -f Dockerfile .
docker push ${REGISTRY}detr-resnet50-finetune:v3

# Generate and deploy training manifest
export IMAGE_URI=${REGISTRY}detr-resnet50-finetune:v3
export NUM_NODES=2
export INSTANCE_TYPE=ml.g5.8xlarge
envsubst '$IMAGE_URI $NUM_NODES $INSTANCE_TYPE' \
  < kubernetes/detr-resnet50-finetune.yaml-template \
  > detr-resnet50-finetune.yaml
kubectl apply -f detr-resnet50-finetune.yaml

Test Results

Distributed training verified with 2x ml.g5.8xlarge nodes over EFA:

  • DDP confirmed: DistributedSampler active (5 train batches/worker, split from 9 single-GPU), NCCL process group initialized over EFA with Libfabric
  • Loss: Decreased from ~1.45 (epoch 1) to ~1.25 (epoch 22) on the 36-image Supermarket Shelves dataset
  • Checkpoint resume: Verified — training correctly resumes from /fsx/checkpoint/checkpoint.pth.tar on restart
  • Air-gapped operation: Model weights baked into Docker image; no internet access required from training pods (TRANSFORMERS_OFFLINE=1)
  • Restart behavior: restartPolicy: ExitCode with maxRestarts: 3 prevents infinite restart loop on successful completion

Directory Structure

3.test_cases/
└── pytorch/
    └── detr-finetune/
        ├── Dockerfile
        ├── README.md
        ├── detr_main.py
        ├── .gitignore
        ├── data/
        │  └── README.md
        └── kubernetes/
           ├── README.md
           └── detr-resnet50-finetune.yaml-template

Checklist

  • I have read the contributing guidelines (https://github.com/awslabs/awsome-distributed-training/blob/main/CONTRIBUTING.md).
  • I am working against the latest main branch.
  • I have searched existing open and recently merged PRs to confirm this is not a duplicate.
  • The contribution is self-contained with documentation and scripts.
  • External dependencies are pinned to a specific version or tag (no latest).
  • A README is included or updated with prerequisites, instructions, and known issues.
  • New test cases follow the expected directory structure (#directory-structure).

Add a new PyTorch DDP test case for fine-tuning DETR-ResNet50 on a custom
object detection dataset (supermarket shelves) using Qualcomm AI Hub
pre-trained weights. Designed for distributed training on SageMaker
HyperPod with EKS orchestration.

Key components:
- Training script with torchmetrics mAP evaluation
- Dockerfile with HPC-optimized base image (EFA + NCCL)
- Kubeflow PyTorchJob manifest with envsubst parameterization
- Dataset preparation and Kubernetes deployment documentation

Tested on 2x ml.g5.8xlarge with elastic scaling (2-36 replicas).
- Bump qai-hub-models to 0.30.2 for torch 2.5.1 compatibility
- Bake DETR-ResNet50 weights into Docker image for air-gapped operation
- Add TRANSFORMERS_OFFLINE/HF_HUB_OFFLINE env vars to YAML template
- Remove model-cache emptyDir volume that overwrote baked-in weights
- Remove redundant runtime model pre-download step
- Clarify model weight provenance (HuggingFace Hub via QAI Hub wrapper)
- Add PVC naming guidance (fsx-pvc default, sed command for alternatives)
- Note that Docker build requires internet for weight download
- Mention SageMaker HyperPod EKS alongside vanilla EKS in prerequisites
- Note that Kubeflow Training Operator is pre-installed on HyperPod
- Remove visible <!-- omit in toc --> directive from README titles
- Correct expected validation loss from ~0.64 to ~1.24
- Remove unverified training time estimate
- Change restartPolicy from OnFailure to ExitCode to prevent infinite loop
- Reduce maxRestarts from 100 to 3
- Auto-resume from checkpoint on legitimate restarts
- Fix DDP: change --dist-url default to env:// and pass it in YAML template
- Fix LOCAL_RANK: read from environment when launched via torchrun
- Fix DDP unused params: add find_unused_parameters=True (custom heads
  replace original COCO heads, leaving original parameters unused)
- Fix RandomHorizontalFlip: remove it since it flips images but not
  bounding box coordinates, causing misaligned training data
- Remove unused import numpy, QAI_HUB_MODEL_ID env var, /workspace/src
  from PYTHONPATH
- Change maxReplicas from hardcoded 36 to $NUM_NODES
- Clarify batch-size default in README (code=8, YAML template=4)

Tested on 2x ml.g5.8xlarge HyperPod EKS cluster: confirmed DDP active
via DistributedSampler (5 batches/worker vs 9 single-GPU), NCCL over
EFA, and loss decreasing from ~1.45 to ~1.25 over 22 epochs.
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/5 — Structure & Repository Hygiene

Thanks for the detailed PR description and the test plan. Posting in 5 themed batches; this first one covers repo-hygiene items that touch multiple files.

Directory placement under pytorch/ rather than under a library

Path: 3.test_cases/pytorch/detr-finetune/ (whole tree)

The convention documented in the project's CLAUDE.md is 3.test_cases/<framework>/<library>/[<model>/]. Looking at the existing siblings (pytorch/ddp/, pytorch/FSDP/, pytorch/deepspeed/, pytorch/distillation/, pytorch/torchtitan/, …) the second level is normally a parallelism strategy or a library, and the model lives one level deeper. Since this test case is plain DDP, I'd suggest moving it to 3.test_cases/pytorch/ddp/detr-finetune/ to match precedent — that keeps pytorch/<library>/<model> shape consistent and makes it discoverable alongside the other DDP examples. A few existing entries (nanoVLM, verl, picotron) do sit directly under pytorch/, so there is some prior inconsistency, but matching the dominant pattern would be my preference.

Missing copyright/license headers

Files: Dockerfile, detr_main.py, kubernetes/detr-resnet50-finetune.yaml-template, all the new READMEs.

Per CLAUDE.md and the contributing guidelines, all new contributions should carry the license header:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

(adjust comment marker per file type — <!-- ... --> for markdown). I noticed none of the new files in this PR include it — could you add the header to the Dockerfile, the training script, the YAML template, and the markdown files? It's a small change but the CI lint and downstream license tooling key on it.

@@ -0,0 +1,56 @@
# DETR-ResNet50 Object Detection Training Container
# Base image: HPC-optimized with pre-configured EFA and NCCL for distributed training
FROM public.ecr.aws/hpc-cloud/nccl-tests:latest
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base image pinned to :latest

Per the contributing guidelines and the CI version-check workflow, container image tags must be pinned to a specific version or commit — never latest. This matters because the nccl-tests base is rebuilt periodically with new EFA/NCCL/CUDA versions, and using :latest means a future rebuild of this Dockerfile could land on a stack that silently changes the EFA installer version, NCCL version, or CUDA version, all of which the repo's CI explicitly enforces minimums for (EFA >= 1.47.0, NCCL >= 2.28, CUDA >= 13.0).

Could you pin to a specific tag? You can list available tags with aws ecr-public describe-image-tags --repository-name nccl-tests --registry-id 098967265814, or pick the digest used during your verified test run. Other test cases in the repo (e.g., 3.test_cases/pytorch/FSDP/Dockerfile) use tagged base images — that's good precedent to follow.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/5 — Deployment Pipeline (K8s)

Findings on the PyTorchJob template and Kubernetes deployment instructions.

- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL_SOCKET_IFNAME=eth0 breaks on EFA-equipped instances

This is the most important finding for me. Positive interface selection like eth0 is fragile on AWS GPU instances: instances with EFA enroll additional interfaces (efa0, eth1, etc.) and on some AMIs the primary may not be named eth0 (e.g., predictable interface names like ens5). The repo convention (see the EFA cheatsheet and micro-benchmarks/nccl-tests/nccl-tests.Dockerfile) is exclusion-based.

The fact that your verified test run worked on ml.g5.8xlarge is consistent with that AMI happening to expose eth0, but it isn't portable. I'd suggest switching to ^lo so this template works on the broader instance family and on future AMIs.

Suggested change
value: "eth0"
value: "^lo"

Comment on lines +43 to +49
resources:
requests:
nvidia.com/gpu: 1
vpc.amazonaws.com/efa: 1
limits:
nvidia.com/gpu: 1
vpc.amazonaws.com/efa: 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worker manifest does not declare CPU/memory requests

I noticed CPU and memory aren't requested. For a single-GPU PyTorchJob on ml.g5.8xlarge (32 vCPU, 128 GiB) this usually works fine, but without a CPU request the kubelet won't reserve resources and a sidecar / DaemonSet competing for the node can cause OOM-killed dataloader workers. (Side note: the elasticPolicy.metrics block above references CPU utilisation; in practice Training Operator honours minReplicas/maxReplicas/rdzvBackend for elastic Worker replicas and largely ignores the metrics target, so this isn't load-bearing — just worth a CPU request for kubelet accounting.)

Suggested change
resources:
requests:
nvidia.com/gpu: 1
vpc.amazonaws.com/efa: 1
limits:
nvidia.com/gpu: 1
vpc.amazonaws.com/efa: 1
resources:
requests:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: 1
vpc.amazonaws.com/efa: 1
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: 1
vpc.amazonaws.com/efa: 1

export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
docker build -t ${REGISTRY}detr-finetune:latest -f Dockerfile .
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image tag :latest in the deployment example

The PR body shows you used detr-resnet50-finetune:v3 for the actual verified run. I'd suggest aligning the docs with that — using a versioned tag (or a ${VERSION} variable) makes the example consistent with the no-latest rule and prevents readers from accidentally picking up a stale image when iterating. The same applies to lines 61 and 69 below (docker image push and IMAGE_URI).

Side note: the PR body uses detr-resnet50-finetune while this README uses detr-finetune for the ECR repo name — picking one and using it consistently across the README, the directory, and the YAML default would make the docs easier to follow.

Suggested change
docker build -t ${REGISTRY}detr-finetune:latest -f Dockerfile .
docker build -t ${REGISTRY}detr-finetune:v1 -f Dockerfile .


```bash
# Create repository if needed
REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"detr-finetune\" | wc -l)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aws ecr describe-repositories | grep ... is fragile

This relies on the literal quoting in the JSON output — it works today but it'll match any other repo whose name contains detr-finetune as a substring, and it'll silently fail if AWS ever changes JSON formatting. The cleaner pattern other test cases use is to query the specific repo by name and create on failure.

Suggested change
REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"detr-finetune\" | wc -l)
aws ecr describe-repositories --repository-names detr-finetune \
>/dev/null 2>&1 || aws ecr create-repository --repository-name detr-finetune

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 3/5 — Training Code Quality

Findings on detr_main.py. The most consequential here are the checkpoint-save gate (every rank writes the same file under torchrun) and the resume path's dependency on original_state_dict. The simplified positional DETR loss is intentional for the workshop demo, but worth calling out so the next reader doesn't carry it into a production setting unchanged.

Comment on lines +884 to +886
if not args.multiprocessing_distributed or (
args.multiprocessing_distributed and args.rank % ngpus_per_node == 0
):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint save fires on every rank when launched via torchrun

The YAML template launches via python3 -m torch.distributed.run (i.e. torchrun) without setting --multiprocessing-distributed, so args.multiprocessing_distributed is False (action="store_true", default False). With False, the gate short-circuits to True on every rank, so every worker pod calls torch.save(...) against /fsx/checkpoint/checkpoint.pth.tar simultaneously. After DDP all-reduce the parameters are identical across ranks, so the harm isn't divergent content — it's a write race on the same file (and a second race on shutil.copyfile to model_best.pth.tar). Lustre is mostly forgiving but the race can yield truncated reads on the next resume, and there's no dist.barrier() between save and the start of the next epoch.

References:

Suggested change
if not args.multiprocessing_distributed or (
args.multiprocessing_distributed and args.rank % ngpus_per_node == 0
):
is_rank_zero = (not args.distributed) or args.rank == 0
if is_rank_zero:

scheduler.step()

# Track best validation loss
is_best = final_val_loss < best_loss
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation loss is rank-local; "best" decision diverges across ranks

Each rank's DistributedSampler(val_dataset, shuffle=False) (line 831) shards validation across workers, so validate() returns the loss over only that rank's shard. There's no dist.all_reduce on losses.avg before this is_best comparison. With the checkpoint-save bug above unfixed, two ranks may both decide they are best on different epochs and both copy to model_best.pth.tar. Even after fixing the gate, the saved checkpoint reflects only rank 0's view of the data.

Could you all-reduce losses.sum and losses.count (or losses.avg) across the process group before deciding is_best? Something like:

if args.distributed:
    t = torch.tensor([losses.sum, losses.count], device=f"cuda:{args.gpu}")
    dist.all_reduce(t, op=dist.ReduceOp.SUM)
    losses.avg = (t[0] / t[1]).item()

The same applies to evaluate_map()MeanAveragePrecision.compute() is being called on a rank-local shard, so the printed mAP is partial. For a small (9-image) val set this might be tolerable for a demo, but it's worth noting in docs that the reported mAP is rank-0's view.

Comment on lines +199 to +205
img_name = ann_file.stem # e.g. "001.jpg" from "001.jpg.json"
img_path = self.images_dir / img_name
for ext in [".jpg", ".jpeg", ".png"]:
candidate = self.images_dir / f"{img_name}{ext}"
if candidate.exists():
img_path = candidate
break
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image lookup loop is dead code

Because Path("001.jpg.json").stem is "001.jpg", img_name already includes the extension; the fallback loop builds 001.jpg.jpg, 001.jpg.jpeg, 001.jpg.png — none of which exist. The initial assignment img_path = self.images_dir / img_name is what actually finds the image, so it works for .jpg but is silently wrong for .png annotations (it would still resolve via the initial assignment too).

I'd suggest replacing with a single with_suffix-aware lookup:

Suggested change
img_name = ann_file.stem # e.g. "001.jpg" from "001.jpg.json"
img_path = self.images_dir / img_name
for ext in [".jpg", ".jpeg", ".png"]:
candidate = self.images_dir / f"{img_name}{ext}"
if candidate.exists():
img_path = candidate
break
img_stem = ann_file.name.removesuffix(".json") # e.g. "001.jpg"
img_path = self.images_dir / img_stem
if not img_path.exists():
raise FileNotFoundError(img_path)

It's not a runtime bug today (the Supervisely format names annotations <image_basename>.<ext>.json), but the loop is misleading code that suggests a behavior it doesn't actually implement.

Comment on lines +624 to +634
original_state_dict = state["state_dict"]
cleaned_state_dict = {}

for key, value in original_state_dict.items():
new_key = key.replace("module.", "") # Remove DDP prefix
new_key = new_key.replace("qai_model.", "") # Remove wrapper prefix
cleaned_state_dict[new_key] = value

state_to_save = state.copy()
state_to_save["state_dict"] = cleaned_state_dict
state_to_save["original_state_dict"] = original_state_dict
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resume path depends on original_state_dict and breaks silently if absent

save_checkpoint overwrites state["state_dict"] with a cleaned dict that strips module. (DDP) and qai_model. (wrapper) prefixes, then stores the original under state["original_state_dict"]. On resume (line 794):

state_dict_to_load = checkpoint.get("original_state_dict", checkpoint["state_dict"])
model.load_state_dict(state_dict_to_load)

If original_state_dict is ever missing — e.g., a checkpoint produced by an earlier or future variant of this script, or someone hand-edits the file — load_state_dict will be called with the cleaned dict against a DDP+wrapper model and fail with one missing-key error per parameter. The verified test plan shows resume works because both keys are present, but the design is fragile.

I'd suggest keeping state["state_dict"] as the trainable form and storing the cleaned version under a separate key like inference_state_dict (or saving two files, one for resume and one for export). The current shape inverts the convention every other PyTorch example follows and will surprise readers.

}


def create_model(num_classes: int, pretrained: bool = True):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--pretrained flag is documented but ignored

pretrained is accepted as a parameter but never read — DETRModel.from_pretrained() is called unconditionally on line 370. The README and CLI table both advertise --pretrained as a meaningful toggle ("use pre-trained model"). I'd suggest either honouring the flag (e.g., construct an unpretrained model when --pretrained is unset) or removing it from the argparser and the docs to avoid drift between code and docs.

args.batch_size = int(args.batch_size / ngpus_per_node)
args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[args.gpu], find_unused_parameters=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_unused_parameters=True may be masking a real graph

find_unused_parameters=True adds a per-iteration traversal cost and is generally a band-aid. Looking at QAIHubDETRWrapper.forward, all parameters of class_embed and bbox_embed are used on every step, and the DETR backbone is also fully used. If the unused-params flag is needed, it likely points at a frozen submodule somewhere in the QAI Hub wrapper. Could you confirm which parameters are unused? In most cases you can either (a) drop the flag entirely, (b) freeze the backbone explicitly with requires_grad_(False) and exclude it from the optimizer, which is faster than relying on DDP's discovery.

Comment on lines +528 to +530
loss = total_loss / valid_samples
else:
loss = torch.tensor(0.0, requires_grad=True).cuda(args.gpu)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty-batch loss path is incompatible with removing find_unused_parameters

When every target in a batch is empty, this constructs a leaf tensor with no graph and calls .backward() on it. With find_unused_parameters=True (current setting) DDP tolerates the resulting "no parameters received gradient" state. If the previous finding (drop find_unused_parameters) is acted on, this branch will fire DDP's "some parameters did not receive grad" assertion and crash the worker. I'd suggest skipping the optimizer step entirely on empty batches:

Suggested change
loss = total_loss / valid_samples
else:
loss = torch.tensor(0.0, requires_grad=True).cuda(args.gpu)
if valid_samples == 0:
continue
loss = total_loss / valid_samples

Comment on lines +515 to +525
class_loss = criterion(sample_logits, sample_labels)

# Bounding box L1 loss
sample_boxes = pred_boxes[batch_idx, :num_objects, :]
target_boxes = target["boxes"][:num_objects, :].cuda(args.gpu)
bbox_loss = nn.functional.l1_loss(sample_boxes, target_boxes)

# Combined loss (bbox weighted higher for detection)
combined_loss = class_loss + 5.0 * bbox_loss
total_loss += combined_loss
valid_samples += 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-batch loss is not the DETR matching loss

This pairs the first num_objects queries with the first num_objects ground-truth boxes positionally, which doesn't match what DETR actually trains against (Hungarian matching between all 100 queries and the GT set, with a "no-object" class for unmatched queries). The PR description says this is a workshop/demo training, and the loss decreases on the small dataset, so I understand the simplification — but I'd strongly suggest adding a docstring or README note that the loss is intentionally simplified and not the standard DETR loss, so a reader doesn't copy this into a production setting and find that mAP plateaus.

Alternatively, since transformers is already a dep, you can call DetrForObjectDetection's built-in loss directly with labels= as the typical HF training pattern — it implements the matcher.

f.write(f"Completed: {datetime.now().strftime('%Y-%m-%d')}\n")
print(f"[OK] Training statistics saved to: {stats_file}")
except Exception as e:
print(f"[WARN] Could not save statistics file: {e}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No dist.destroy_process_group() at exit

After training completes, main_worker returns without tearing down the NCCL process group. PyTorch will emit warnings on exit and, depending on the NCCL version, can leak GPU memory across pod restarts (relevant because maxRestarts: 3 means the same node may rerun this script). I'd suggest adding at the end of main_worker (or wrapping the body in try/finally):

if args.distributed:
    dist.destroy_process_group()

metric.update(preds, tgts)

results = metric.compute()
print("[INFO] mAP Evaluation Results:")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evaluate_map prints from every rank

The mAP results are printed unconditionally, so every rank dumps its own (partial, shard-local) numbers to kubectl logs. Combined with the rank-local metric issue noted above, readers see N inconsistent mAP tables per epoch. Gating the print block on args.rank == 0 (or a small log() helper) makes the output unambiguous.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 4/5 — Infrastructure & NCCL Configuration

One inline finding here on the Dockerfile's air-gapped claim. The NCCL_SOCKET_IFNAME issue from Batch 2 is the other infrastructure-side item, kept there because it's anchored to the YAML.

Comment on lines +42 to +49
# Pre-download DETR-ResNet50 weights so the container works in air-gapped environments.
# The weights originate from facebook/detr-resnet-50 on HuggingFace Hub; the QAI Hub
# Model.from_pretrained() wrapper downloads them via the transformers library.
# NOTE: This step requires internet access during docker build.
ENV TRANSFORMERS_CACHE=/workspace/cache/transformers
ENV HF_HOME=/workspace/cache/huggingface
RUN mkdir -p /workspace/cache/transformers /workspace/cache/huggingface && \
python3 -c "from qai_hub_models.models.detr_resnet50 import Model; m = Model.from_pretrained(); print('DETR-ResNet50 weights cached')"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Air-gapped" runtime claim has a qai_hub telemetry caveat

HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 cover the HuggingFace path that DETRModel.from_pretrained() uses today, so the runtime-no-internet claim holds for the current code path. However, qai-hub-models 0.30.x transitively installs the qai_hub client which can read ~/.qai_hub/client.ini and attempt API calls if any helper beyond Model.from_pretrained() is invoked (e.g., compile, target_runtime).

I'd suggest either:

  • adding a one-line note in the README that "air-gapped" is verified for Model.from_pretrained() only, plus adding QAIHM_DISABLE_TELEMETRY=1 to the YAML env block defensively, or
  • adding a runtime assertion in detr_main.py that no outbound HTTPS is needed by setting HTTPS_PROXY= to an unreachable address during the smoke test.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 5/5 — Documentation Consistency + Wrap-up

Final batch — small documentation drift items, the things that look great, and a suggested priority for addressing the findings across all 5 batches.


Things That Look Great

  • Clear, complete test plan: The PR body documents the exact verified configuration (2x ml.g5.8xlarge, NCCL/EFA confirmed, loss curve, checkpoint resume tested, air-gapped behaviour) — that's exactly the level of evidence I want to see for a new test case.
  • Self-contained contribution: Dockerfile, training script, Kubernetes manifest, dataset prep doc, and main README all live together and cross-link cleanly. New readers can go from zero to running.
  • Pinned Python dependencies: Every pip dependency in the Dockerfile is pinned to a specific version (torch==2.5.1, qai-hub-models==0.30.2, transformers==4.51.3, …), and the inline comment explaining why qai-hub-models is installed without the [detr-resnet50] extra is the kind of context I love seeing in Dockerfiles.
  • envsubst whitelist: The deployment instructions correctly use an explicit variable whitelist (envsubst '$IMAGE_URI $NUM_NODES $INSTANCE_TYPE'), avoiding accidental substitution of YAML's ${...} patterns.
  • PyTorchJob with elastic policy and restartPolicy: ExitCode: Using Kubeflow's PyTorchJob with maxRestarts: 3 is the right pattern for distributed training — it stays well under the "burning GPU hours on repeated failures" threshold and is consistent with other K8s test cases in the repo.
  • Air-gapped runtime: Pre-baking model weights into the image with TRANSFORMERS_OFFLINE=1/HF_HUB_OFFLINE=1 set at runtime is a thoughtful pattern for clusters without NAT, and it's clearly documented in the README.
  • Checkpoint resume from FSx: The script's RESUME_FLAG pattern in the YAML command is a clean way to wire up automatic resume on pod restarts, and it's verified end-to-end in the test plan.
  • .gitignore: Sensibly scoped — model artifacts, caches, and wandb/ are excluded so accidental commits of large outputs are unlikely.

Suggested priority

Items I'd want addressed before merge (correctness / portability):

  1. NCCL_SOCKET_IFNAME=eth0 → ^lo (Batch 2) — portability across AMIs
  2. Checkpoint save fires on every rank under torchrun (Batch 3) — file-write race on FSx
  3. Resume path depends on original_state_dict (Batch 3) — fragile checkpoint format
  4. Base image :latest (Batch 1) — pin to a specific tag
  5. Missing copyright headers (Batch 1) — required by repo conventions

Everything else is merge-friendly follow-up work.

# The weights originate from facebook/detr-resnet-50 on HuggingFace Hub; the QAI Hub
# Model.from_pretrained() wrapper downloads them via the transformers library.
# NOTE: This step requires internet access during docker build.
ENV TRANSFORMERS_CACHE=/workspace/cache/transformers
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TRANSFORMERS_CACHE is deprecated in favor of HF_HOME

Per the HuggingFace docs TRANSFORMERS_CACHE has been deprecated for several major versions in favor of HF_HOME. With transformers==4.51.3 it still works but emits a deprecation warning. Could you drop TRANSFORMERS_CACHE here (and in the YAML template at lines 56–57) and rely on HF_HOME alone? That keeps the test case forward-compatible with future transformers versions.

Suggested change
ENV TRANSFORMERS_CACHE=/workspace/cache/transformers

Comment on lines +56 to +57
- name: TRANSFORMERS_CACHE
value: "/workspace/cache/transformers"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop TRANSFORMERS_CACHE (deprecated) — HF_HOME alone is sufficient

Same finding as in the Dockerfile: per the HuggingFace docs, TRANSFORMERS_CACHE is deprecated. The next two lines (HF_HOME=/workspace/cache/huggingface) cover the cache path correctly.

Suggested change
- name: TRANSFORMERS_CACHE
value: "/workspace/cache/transformers"


### Different Classes

Edit `meta.json` to define your own classes:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta.json location is unclear in this section

The Customization section says "Edit meta.json to define your own classes" without saying where the file lives. A reader has to jump to data/README.md to learn it goes at the dataset root next to Supermarket shelves/. A one-line note here would save the trip:

Suggested change
Edit `meta.json` to define your own classes:
Edit `meta.json` (place at `<data-dir>/meta.json` — see [data/README.md](data/README.md)) to define your own classes:

Comment on lines +163 to +164
Checkpoints are saved to the directory specified by `CHECKPOINT_DIR` environment
variable (default: `/tmp/checkpoints`):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECKPOINT_DIR default in README differs from K8s default

The README says checkpoints land in CHECKPOINT_DIR with default /tmp/checkpoints. The K8s template script (yaml-template lines 102–106) sets CHECKPOINT_DIR to /fsx/checkpoint when writable — which is the actually-used path during the verified test run. Could you clarify in the README that the script-default differs from the deployment-default, or just point to /fsx/checkpoint since that's what reviewers checking the test plan will see in kubectl logs?

Suggested change
Checkpoints are saved to the directory specified by `CHECKPOINT_DIR` environment
variable (default: `/tmp/checkpoints`):
Checkpoints are saved to the directory specified by `CHECKPOINT_DIR` environment
variable (script default: `/tmp/checkpoints`; the Kubernetes deployment sets this to `/fsx/checkpoint`):

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants