CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield

The article presents CHERRY-1.8B, a Korean foundation model that integrates three techniques for training compute-efficient language models: selective supervision, depth compression with recurrent recovery, and fusion of compressed experts.

Selective Ground Truth Token Training (SGT) concentrates supervision on ~15% of output tokens, yielding 4.5x per-supervised-token efficiency while improving unsupervised tokens through gradient coupling. Depth compression reduces a 48-layer, 1B-parameter transformer to 6 layers (227M parameters), which is restored via learned recurrent unrolling to reach a loss of 2.934, comparable to a 566M dense model. Assembling compressed models as a Mixture of Efficient Experts (MoEE) with multi-token prediction further improves performance, achieving a loss of 2.789.

The authors validate these techniques on CHERRY-1.8B, noting that every trainable parameter derives from their own training runs and explicitly defining the scope of evidence as limited to one model family and Korean data.