The paper introduces MultiHashFormer, a framework enabling hash-based autoregression in causal language models by representing tokens as unique signatures of discrete hash IDs. This approach allows the model to compress token information into latent vectors for Transformer processing while mapping them back to text, effectively addressing the many-to-one collision issues that previously prevented hashing in generative contexts.
- Each token is represented as a short sequence of discrete hash IDs generated by multiple independent hash functions.
- A Hash Encoder compresses these signatures into single latent vectors for the Transformer decoder.
- A Hash Decoder generates the next token's hash signature, which is then mapped back to text.
- The approach was evaluated at 100M, 1B, and 3B parameter scales.
- MultiHashFormer consistently outperforms standard Transformer LMs across multiple benchmarks.
- The model handles multilingual vocabulary expansion with a constant parameter footprint without modifications.
This method allows for significant parameter efficiency and the ability to expand multilingual vocabularies without increasing the model's size.