Skip to content

Notable Non-Record: Switched Deep Supervision (first DS submission)#1629

Open
channyzf6 wants to merge 1 commit intoopenai:mainfrom
channyzf6:ds-submission
Open

Notable Non-Record: Switched Deep Supervision (first DS submission)#1629
channyzf6 wants to merge 1 commit intoopenai:mainfrom
channyzf6:ds-submission

Conversation

@channyzf6
Copy link
Copy Markdown

Summary

First Deep Supervision (DS) submission in the competition. Introduces Switched Deep Supervision — a novel auxiliary loss scheduling strategy where a single randomly-chosen intermediate layer receives auxiliary CE supervision through the shared LM head each step.

Single-seed result (seed 42, 8×H100, 588s):

  • Quantized + TTT BPB: 1.08288
  • Quantized + sliding window BPB: 1.08449
  • Pre-quant post-EMA BPB: 1.08933
  • Total artifact: 15,997,104 bytes (under 16 MB) ✓

Status: Non-record. Does not beat current SOTA (PR #1493 at 1.0810). Submitted as scientifically interesting research with documented negative results.

Novel Contributions

  1. Deep Supervision via shared LM head — auxiliary CE loss at intermediate layers (default 6, 7, 9), reusing the existing tied embedding. Zero new parameters, zero artifact cost.

  2. Switched DS — randomly pick ONE DS layer per step instead of supervising all selected layers. Reduces per-step compute by ~3× while preserving most of the per-step convergence benefit.

  3. Fraction-based DS alpha decay — alpha decays linearly from 70% → 85% of training, then off, decoupling DS-induced weight oscillation from final EMA averaging.

  4. Per-layer adaptive GPTQ + int7 embeddings — adopted from PR Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean) #1586 lineage to fit valid 16 MB artifact.

Documented Negative Results

Included for community benefit:

  • Predictive Coding (cosine similarity inter-layer prediction): net negative across all alpha values tested (0.005, 0.01, 0.1)
  • Multi-Token Prediction with full transformer block heads: -0.005 to -0.010 BPB worse than pure DS
  • Medusa-style MTP (RMSNorm + Linear heads): -0.0016 BPB worse than pure DS even with init fix
  • See README for full ablation table

Architecture

Built on the April 2026 SOTA stack (PR #1493 by bigbag): SP8192, depth recurrence (loop layers 3-5 three times, activate at 35%), parallel residuals (layers 7-10), MuonEq-R optimizer, QK-Gain 5.25, XSA on all 11 layers, FlashAttention 3, EMA decay 0.9965, warmdown fraction 0.72, legal score-first TTT.

Plus our novel additions: Switched DS at layers 6,7,9 with alpha=0.01.

Compliance (Track B)

All four conditions per Issue #1017 satisfied:

  • Causality (sliding window with prefix-only)
  • Normalized softmax (no n-gram/logit bias)
  • Score-before-update (TTT chunks scored under torch.no_grad() before SGD)
  • Single pass (each token scored once)

DS heads are training-only (not in artifact). All artifacts < 16,000,000 bytes. Training < 600s on 8×H100.

Reproduction

See README.md for full reproduction commands.

Test plan

  • Single-seed (42) validated on 8×H100, fits in 16 MB
  • All four compliance conditions (Issue A Field Guide to Valid Submissions #1017) verified
  • Documented architecture, hyperparameters, ablation
  • 3-seed validation (in progress, pending compute credits)
  • Top-K sampled softmax for DS auxiliary losses (separate branch, pending H100 validation)

First Deep Supervision submission in the competition.

Key contributions:
- Introduces Switched DS: random single-layer auxiliary CE supervision
  through the shared LM head (alpha=0.01)
- Fraction-based DS alpha decay (active 0-70%, decay 70-85%, off 85-100%)
  to decouple DS-induced weight oscillation from EMA averaging
- Per-layer adaptive GPTQ + int7 embeddings for valid 16MB artifact

Single-seed (42) result on 8xH100:
- Pre-quant post-EMA BPB: 1.08933
- Quantized BPB: 1.10110
- Sliding window BPB: 1.08449
- TTT BPB: 1.08288
- Artifact: 15,997,104 bytes (under 16MB)
- Steps: 4316

Comparison to merged SOTA (PR openai#1493): +0.0019 BPB worse.
Not record-eligible. Posted as scientifically interesting non-record
submission with documented negative results on PC and MTP variants.

Built on April 2026 SOTA stack: SP8192, depth recurrence, parallel
residuals, MuonEq-R, XSA-all, score-first TTT.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant