KD Capacity Gap · Interactive Explorer

Student Capacity Moderates Knowledge Distillation Effectiveness Yaşar (2026) · arXiv:2605.31191 · CIFAR-10 · ResNet teacher-student pairs


Three Main Findings

① Student capacity moderates KD gain — R34 students benefit substantially more than R18 students, even when teacher-student accuracy gaps are nearly identical (0.57 pp vs 0.56 pp).

② Implementation correctness matters — A gradient clipping bug that excluded projection layers from clip_grad_norm_ suppressed Feature-KD by ~0.18 pp and produced misleading comparisons. After correction, Feature-KD matches or outperforms Logit-KD in 2 of 3 pairs.

③ Architecture dominates KD — CIFAR-specific stem fix (conv1: 7×7 stride-2 → 3×3 stride-1, remove MaxPool) yields +5 pp — an order of magnitude larger than any KD gain (+0.30 pp max).