ToDi: Token-wise Distillation via Fine-Grained Divergence Control
ToDi is a token-wise knowledge distillation method that dynamically balances Forward and Reverse KL divergence based on prediction discrepancies, enabling fine-grained alignment between teacher and student models and outperforming existing approaches on instruction-following tasks.
May 22, 2025