Large Language Models

ToDi: Token-wise Distillation via Fine-Grained Divergence Control

ToDi is a token-wise knowledge distillation method that dynamically balances Forward and Reverse KL divergence based on prediction discrepancies, enabling fine-grained alignment between teacher and student models and outperforming existing approaches on instruction-following tasks.

Nov 5, 2025