The authors propose ARKD, a reinforcement-learning-based adaptive KL-weighted distillation framework that addresses the limitations of single KL objective methods in compressing Large Language Models. By using a policy network to dynamically assign weights to forward and reverse KL divergence based on teacher-student distributional characteristics, the method achieves dual alignment on principal and long-tail modes.
- Utilizes a policy network guided by immediate reward signals to adaptively weight forward and reverse KL divergence.
- Balances primary distribution fitting with long-tail probability modeling for improved generation quality.
- Surpasses greedy heuristics by 0.4-0.6 points on Rouge-L and BertScore metrics.
- Demonstrates consistent improvements over other baseline methods across diverse benchmarks.
This approach enhances both the generation quality and generalization of compressed models by effectively addressing the trade-offs inherent in traditional knowledge distillation techniques.