KD Capacity Gap Explorer

Three Main Findings

① Student capacity moderates KD gain — R34 students benefit substantially more than R18 students, even when teacher-student accuracy gaps are nearly identical (0.57 pp vs 0.56 pp).

② Implementation correctness matters — A gradient clipping bug that excluded projection layers from clip_grad_norm_ suppressed Feature-KD by ~0.18 pp and produced misleading comparisons. After correction, Feature-KD matches or outperforms Logit-KD in 2 of 3 pairs.

③ Architecture dominates KD — CIFAR-specific stem fix (conv1: 7×7 stride-2 → 3×3 stride-1, remove MaxPool) yields +5 pp — an order of magnitude larger than any KD gain (+0.30 pp max).

KD Gains

Accuracy Landscape

Gap vs. Gain

Bug Analysis

Interactive Configuration Explorer

Select a teacher, student, and distillation method to explore results for that specific configuration. Studied pairs: R50→R18, R34→R18, R50→R34.

Teacher

Student

KD Method

Configuration Results

About This Demo

This Space accompanies the paper:

Student Capacity Moderates Knowledge Distillation Effectiveness: A Systematic Study Across ResNet Teacher-Student Pairs on CIFAR-10 Umut Onur Yaşar · arXiv:2605.31191 · May 2026

Experimental Setup

Dataset: CIFAR-10 (50k train / 10k test, 32×32)
Architecture: ResNet with CIFAR-specific stem (3×3 conv stride-1, no MaxPool)
Teachers: ResNet-50 (95.81%), ResNet-34 (95.70%)
Students: ResNet-18 (baseline 95.13%), ResNet-34 (baseline 95.25%)
Seeds: 3 seeds per configuration · mean ± std reported
Hardware: NVIDIA A100-SXM4-40GB

KD Methods

Logit-KD: KL divergence on temperature-scaled logits · α ∈ {0.3, 0.5, 0.7}, T ∈ {2, 3, 4}
Feature-KD: MSE + cosine similarity on 4-layer features via 1×1 Conv projections · α ∈ {0.3, 0.5, 0.7}, β=0.5

Links