HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

HERMES is a data-derived labeling substrate that uses a Learned Semantic Transform and 3-stage residual vector quantization to annotate documents into a coarse-to-fine code with up to approximately 130k cells.

It allows granularity control via prefix length, overcoming the limitations of existing labels that commit to a single semantic axis.
At coarse granularity, HERMES performs comparably to KMeans-family methods on standard clustering metrics.
In 1B-parameter, 25B-token pre-training, combining Stage-2 rule contrast with equal-subbucket coverage lifted a 16-task capability macro-average by +0.0253.
The performance gain disappeared at finer levels where candidate pools contracted approximately 5x.

HERMES reframes data mixture design from choosing among fixed label sets to navigating a reusable, data-derived granularity hierarchy.