HERMES is a data-derived labeling substrate that uses a Learned Semantic Transform and 3-stage residual vector quantization to annotate documents into a coarse-to-fine code with up to approximately 130k cells.
- It allows granularity control via prefix length, overcoming the limitations of existing labels that commit to a single semantic axis.
- At coarse granularity, HERMES performs comparably to KMeans-family methods on standard clustering metrics.
- In 1B-parameter, 25B-token pre-training, combining Stage-2 rule contrast with equal-subbucket coverage lifted a 16-task capability macro-average by +0.0253.
- The performance gain disappeared at finer levels where candidate pools contracted approximately 5x.
HERMES reframes data mixture design from choosing among fixed label sets to navigating a reusable, data-derived granularity hierarchy.