Notable Non-Record: Switched Deep Supervision (first DS submission)#1629
Open
channyzf6 wants to merge 1 commit intoopenai:mainfrom
Open
Notable Non-Record: Switched Deep Supervision (first DS submission)#1629channyzf6 wants to merge 1 commit intoopenai:mainfrom
channyzf6 wants to merge 1 commit intoopenai:mainfrom
Conversation
First Deep Supervision submission in the competition. Key contributions: - Introduces Switched DS: random single-layer auxiliary CE supervision through the shared LM head (alpha=0.01) - Fraction-based DS alpha decay (active 0-70%, decay 70-85%, off 85-100%) to decouple DS-induced weight oscillation from EMA averaging - Per-layer adaptive GPTQ + int7 embeddings for valid 16MB artifact Single-seed (42) result on 8xH100: - Pre-quant post-EMA BPB: 1.08933 - Quantized BPB: 1.10110 - Sliding window BPB: 1.08449 - TTT BPB: 1.08288 - Artifact: 15,997,104 bytes (under 16MB) - Steps: 4316 Comparison to merged SOTA (PR openai#1493): +0.0019 BPB worse. Not record-eligible. Posted as scientifically interesting non-record submission with documented negative results on PC and MTP variants. Built on April 2026 SOTA stack: SP8192, depth recurrence, parallel residuals, MuonEq-R, XSA-all, score-first TTT.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First Deep Supervision (DS) submission in the competition. Introduces Switched Deep Supervision — a novel auxiliary loss scheduling strategy where a single randomly-chosen intermediate layer receives auxiliary CE supervision through the shared LM head each step.
Single-seed result (seed 42, 8×H100, 588s):
Status: Non-record. Does not beat current SOTA (PR #1493 at 1.0810). Submitted as scientifically interesting research with documented negative results.
Novel Contributions
Deep Supervision via shared LM head — auxiliary CE loss at intermediate layers (default 6, 7, 9), reusing the existing tied embedding. Zero new parameters, zero artifact cost.
Switched DS — randomly pick ONE DS layer per step instead of supervising all selected layers. Reduces per-step compute by ~3× while preserving most of the per-step convergence benefit.
Fraction-based DS alpha decay — alpha decays linearly from 70% → 85% of training, then off, decoupling DS-induced weight oscillation from final EMA averaging.
Per-layer adaptive GPTQ + int7 embeddings — adopted from PR Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean) #1586 lineage to fit valid 16 MB artifact.
Documented Negative Results
Included for community benefit:
Architecture
Built on the April 2026 SOTA stack (PR #1493 by bigbag): SP8192, depth recurrence (loop layers 3-5 three times, activate at 35%), parallel residuals (layers 7-10), MuonEq-R optimizer, QK-Gain 5.25, XSA on all 11 layers, FlashAttention 3, EMA decay 0.9965, warmdown fraction 0.72, legal score-first TTT.
Plus our novel additions: Switched DS at layers 6,7,9 with alpha=0.01.
Compliance (Track B)
All four conditions per Issue #1017 satisfied:
torch.no_grad()before SGD)DS heads are training-only (not in artifact). All artifacts < 16,000,000 bytes. Training < 600s on 8×H100.
Reproduction
See README.md for full reproduction commands.
Test plan