Deep Dive: The Mechanics of Knowledge Distillation

Estimated Reading Time: 12 minutes

Preface

I first encountered Hinton’s Knowledge Distillation paper back in 2015 during the final two weeks of my internship at IBM Watson Research. At the time, the concept of a student model learning from “dark knowledge” via hard and soft targets was intellectually fascinating. I remember asking my supervisor if we should try to implement it for our project. His response was classic pragmatism: he felt that playing with the temperature would require too much tuning and wouldn’t be a good use of time with only two weeks left.

Looking back now—especially with the rise of massive Large Language Models (LLMs) and specialized techniques like Multi-Token Prediction Distillation (MTP-D)—I realize how wrong that was. Temperature isn’t just a hyperparameter to “play with”; it is arguably the most important concept in the KD paper. It is the dial that controls the flow of structural information between models. This article is a deep dive into the math I wish I’d fully unpacked at Watson a decade ago.

TL;DR

Knowledge Distillation (KD) compresses large models by matching the student’s output distribution to a “softened” teacher distribution. This deep dive derives the gradients of Cross-Entropy and KL Divergence to explain why we must scale the soft loss by $T^2$ to prevent the distillation signal from vanishing, and why we purposefully omit the $1/N$ factor to maintain training stability.

1. The Distillation Framework & Hyperparameters

The core philosophy of Knowledge Distillation is that the “dark knowledge” of a teacher model lies in the relative probabilities assigned to incorrect classes.

The Loss Function

The total objective function is a weighted linear combination:

\[L_{total} = (1-\alpha) L_{hard} + \alpha T^2 L_{soft}\]

Choosing $\alpha$ (Loss Weighting)

The parameter $\alpha$ determines how much the student trusts the teacher vs. the ground truth labels.

The Recommendation: Surprisingly, $\alpha$ is typically set very high, often 0.9 to 0.99 ¹.
The Rationale: If the teacher is high-quality, its soft targets provide a much denser gradient signal than a simple one-hot “hard” label. In many modern LLM distillation setups, researchers use $\alpha = 1.0$ (only soft targets) during the initial stages to maximize the mimicry of the teacher’s logic.

Choosing $T$ (Temperature)

Temperature controls the entropy of the soft targets.

Hinton’s Original Recommendation: In the seminal 2015 paper, Hinton et al. found that using a temperature in the range of 1 to 5 worked best for most tasks, specifically noting that when a student is much smaller than the teacher, intermediate temperatures are often optimal ¹.
Small $T$ ($1$ to $2$): Keeps the distribution “peaky.” This is recommended when the student has enough capacity to match the teacher’s exact confidence levels ².
Large $T$ ($3$ to $10$): Softens the distribution. This is essential when there is a significant capacity gap between teacher and student ¹. It prevents the student from being “overwhelmed” by the teacher’s extreme certainty, focusing instead on the relative similarities between classes.
Dynamic $T$: Newer research suggests starting with a high $T$ to provide a smooth signal and decaying it as training progresses to allow the student to learn more precise boundaries ³.

2. Gradient Derivation: From Softmax to Cross-Entropy

To understand the scaling, we first derive the gradient of the standard Cross-Entropy loss with respect to the logits $z$.

Let $q_i$ be the Softmax output: $q_i = \frac{e^{z_i}}{\sum_k e^{z_k}}$

The Cross-Entropy loss for a target distribution $p$ is: $L = -\sum_j p_j \ln q_j$

Using the chain rule for the $i$-th logit $z_i$: $\frac{\partial L}{\partial z_i} = \sum_j \frac{\partial L}{\partial q_j} \frac{\partial q_j}{\partial z_i}$

Term 1: $\frac{\partial L}{\partial q_j} = -\frac{p_j}{q_j}$
Term 2 (Softmax Jacobian): * If $i=j$: $q_i(1-q_i)$
- If $i \neq j$: $-q_i q_j$

Combining these: $\frac{\partial L}{\partial z_i} = -\frac{p_i}{q_i} [q_i(1-q_i)] + \sum_{j \neq i} \frac{p_j}{q_j} [q_i q_j] = q_i - p_i$

The gradient is simply the difference between the prediction and the target.

3. The Temperature Effect and $1/T^2$ Scaling

When we introduce $T$, the student probability becomes $q_i = \text{softmax}(z_i/T)$. The chain rule adds a $1/T$ factor to the gradient.

The Vanishing Gradient Problem

As $T$ increases, probabilities approach $1/N$. Using the Taylor expansion $e^x \approx 1+x$ (assuming zero-meaned logits): $q_i \approx \frac{1}{N} + \frac{z_{si}}{NT}$

Substituting this back into the derivative of the KL loss: $\frac{\partial L_{soft}}{\partial z_i} \approx \frac{1}{T} \left( \left[ \frac{1}{N} + \frac{z_{si}}{NT} \right] - \left[ \frac{1}{N} + \frac{z_{ti}}{NT} \right] \right) = \frac{z_{si} - z_{ti}}{NT^2}$

Why multiply by $T^2$?

Without scaling, the gradient signal vanishes at a rate of $1/T^2$. Multiplying the loss by $T^2$ recovers a gradient magnitude independent of $T$: $\text{Scaled Gradient} \approx \frac{z_{si} - z_{ti}}{N}$

Note on $1/N$: We omit the $N$ multiplier in the loss scaling because logit differences $(z_{si} - z_{ti})$ are large. Multiplying by $N$ (e.g., $128,000$ for an LLM vocab) would cause immediate gradient explosion.

4. Modern Applications of KD

Knowledge Distillation has moved far beyond MNIST digits. Today, it is the primary engine behind making “Frontier AI” accessible on consumer hardware.

LLM Compression: Models like Llama-3-8B utilize distillation to inherit reasoning capabilities from $70\text{B}+$ models while maintaining a footprint small enough for edge devices ⁴.
Multi-Token Prediction (MTP-D): Modern transformers use KD to predict multiple future tokens simultaneously, which is critical for the next generation of efficient inference ⁵.
Knowledge Distillation in RLHF: Used to regularize a “Policy” model against a “Reference” model via KL-divergence, preventing “reward hacking” while preserving the base model’s knowledge ⁶.
Speculative Decoding: Small “draft” models are distilled from “verifier” models to improve the token acceptance rate in lookahead heuristics ⁷.

5. Numerical Example: The Signal Gap

Consider a 1000-class problem ($N=1000$) where a student is learning from a teacher.

Metric	$T=1$ (Unscaled)	$T=5$ (Unscaled)	$T=5$ (Scaled by $T^2$)
Softmax Peakiness	Very High	Low (Soft)	Low (Soft)
KL Loss	0.850	0.020	0.500
Grad Norm	0.420	0.012	0.300

By multiplying by $25$ ($T^2$), we bring the teacher’s influence back to a magnitude where the student can actually “hear” the signal.

References

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531. ↩ ↩² ↩³
Mirzadeh, S. I., et al. (2020). Improved Knowledge Distillation via Teacher Assistant. arXiv:1902.03393. ↩
Li, Z., et al. (2023). Curriculum Temperature for Knowledge Distillation. International Conference on Machine Learning (ICML). ↩
Touvron, H., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288. ↩
DeepSeek-V3 Technical Report (2024). Multi-Token Prediction Distillation (MTP-D) applications. ↩
Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347. ↩
Leviathan, Y., et al. (2023). Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192. ↩

LLM Inference Optimization: 2026 Update - MTP Deep Dive

Blog Archive

Archive of all previous blog posts