Notable Non-Record: Switched Deep Supervision (first DS submission) by channyzf6 · Pull Request #1629 · openai/parameter-golf

channyzf6 · 2026-04-15T00:52:13Z

Summary

First Deep Supervision (DS) submission in the competition. Introduces Switched Deep Supervision — a novel auxiliary loss scheduling strategy where a single randomly-chosen intermediate layer receives auxiliary CE supervision through the shared LM head each step.

Single-seed result (seed 42, 8×H100, 588s):

Quantized + TTT BPB: 1.08288
Quantized + sliding window BPB: 1.08449
Pre-quant post-EMA BPB: 1.08933
Total artifact: 15,997,104 bytes (under 16 MB) ✓

Status: Non-record. Does not beat current SOTA (PR #1493 at 1.0810). Submitted as scientifically interesting research with documented negative results.

Novel Contributions

Deep Supervision via shared LM head — auxiliary CE loss at intermediate layers (default 6, 7, 9), reusing the existing tied embedding. Zero new parameters, zero artifact cost.
Switched DS — randomly pick ONE DS layer per step instead of supervising all selected layers. Reduces per-step compute by ~3× while preserving most of the per-step convergence benefit.
Fraction-based DS alpha decay — alpha decays linearly from 70% → 85% of training, then off, decoupling DS-induced weight oscillation from final EMA averaging.
Per-layer adaptive GPTQ + int7 embeddings — adopted from PR Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean) #1586 lineage to fit valid 16 MB artifact.

Documented Negative Results

Included for community benefit:

Predictive Coding (cosine similarity inter-layer prediction): net negative across all alpha values tested (0.005, 0.01, 0.1)
Multi-Token Prediction with full transformer block heads: -0.005 to -0.010 BPB worse than pure DS
Medusa-style MTP (RMSNorm + Linear heads): -0.0016 BPB worse than pure DS even with init fix
See README for full ablation table

Architecture

Built on the April 2026 SOTA stack (PR #1493 by bigbag): SP8192, depth recurrence (loop layers 3-5 three times, activate at 35%), parallel residuals (layers 7-10), MuonEq-R optimizer, QK-Gain 5.25, XSA on all 11 layers, FlashAttention 3, EMA decay 0.9965, warmdown fraction 0.72, legal score-first TTT.

Plus our novel additions: Switched DS at layers 6,7,9 with alpha=0.01.

Compliance (Track B)

All four conditions per Issue #1017 satisfied:

Causality (sliding window with prefix-only)
Normalized softmax (no n-gram/logit bias)
Score-before-update (TTT chunks scored under torch.no_grad() before SGD)
Single pass (each token scored once)

DS heads are training-only (not in artifact). All artifacts < 16,000,000 bytes. Training < 600s on 8×H100.

Reproduction

See README.md for full reproduction commands.

Test plan

Single-seed (42) validated on 8×H100, fits in 16 MB
All four compliance conditions (Issue A Field Guide to Valid Submissions #1017) verified
Documented architecture, hyperparameters, ablation
3-seed validation (in progress, pending compute credits)
Top-K sampled softmax for DS auxiliary losses (separate branch, pending H100 validation)

First Deep Supervision submission in the competition. Key contributions: - Introduces Switched DS: random single-layer auxiliary CE supervision through the shared LM head (alpha=0.01) - Fraction-based DS alpha decay (active 0-70%, decay 70-85%, off 85-100%) to decouple DS-induced weight oscillation from EMA averaging - Per-layer adaptive GPTQ + int7 embeddings for valid 16MB artifact Single-seed (42) result on 8xH100: - Pre-quant post-EMA BPB: 1.08933 - Quantized BPB: 1.10110 - Sliding window BPB: 1.08449 - TTT BPB: 1.08288 - Artifact: 15,997,104 bytes (under 16MB) - Steps: 4316 Comparison to merged SOTA (PR openai#1493): +0.0019 BPB worse. Not record-eligible. Posted as scientifically interesting non-record submission with documented negative results on PC and MTP variants. Built on April 2026 SOTA stack: SP8192, depth recurrence, parallel residuals, MuonEq-R, XSA-all, score-first TTT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notable Non-Record: Switched Deep Supervision (first DS submission)#1629

Notable Non-Record: Switched Deep Supervision (first DS submission)#1629
channyzf6 wants to merge 1 commit intoopenai:mainfrom
channyzf6:ds-submission

channyzf6 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

channyzf6 commented Apr 15, 2026

Summary

Novel Contributions

Documented Negative Results

Architecture

Compliance (Track B)

Reproduction

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant