KD Capacity Gap · Interactive Explorer
Student Capacity Moderates Knowledge Distillation Effectiveness Yaşar (2026) · arXiv:2605.31191 · CIFAR-10 · ResNet teacher-student pairs
Three Main Findings
① Student capacity moderates KD gain — R34 students benefit substantially more than R18 students, even when teacher-student accuracy gaps are nearly identical (0.57 pp vs 0.56 pp).
② Implementation correctness matters — A gradient clipping bug that excluded projection layers
from clip_grad_norm_ suppressed Feature-KD by ~0.18 pp and produced misleading comparisons.
After correction, Feature-KD matches or outperforms Logit-KD in 2 of 3 pairs.
③ Architecture dominates KD — CIFAR-specific stem fix (conv1: 7×7 stride-2 → 3×3 stride-1,
remove MaxPool) yields +5 pp — an order of magnitude larger than any KD gain (+0.30 pp max).
Interactive Configuration Explorer
Select a teacher, student, and distillation method to explore results for that specific configuration. Studied pairs: R50→R18, R34→R18, R50→R34.
About This Demo
This Space accompanies the paper:
Student Capacity Moderates Knowledge Distillation Effectiveness: A Systematic Study Across ResNet Teacher-Student Pairs on CIFAR-10 Umut Onur Yaşar · arXiv:2605.31191 · May 2026
Experimental Setup
- Dataset: CIFAR-10 (50k train / 10k test, 32×32)
- Architecture: ResNet with CIFAR-specific stem (3×3 conv stride-1, no MaxPool)
- Teachers: ResNet-50 (95.81%), ResNet-34 (95.70%)
- Students: ResNet-18 (baseline 95.13%), ResNet-34 (baseline 95.25%)
- Seeds: 3 seeds per configuration · mean ± std reported
- Hardware: NVIDIA A100-SXM4-40GB
KD Methods
- Logit-KD: KL divergence on temperature-scaled logits · α ∈ {0.3, 0.5, 0.7}, T ∈ {2, 3, 4}
- Feature-KD: MSE + cosine similarity on 4-layer features via 1×1 Conv projections · α ∈ {0.3, 0.5, 0.7}, β=0.5