Study reveals decoder-only language models' hidden states allow input recovery

This work investigates the inversion of decoder-only language models by recovering original input token sequences from their last-layer hidden states through continuous embedding-space optimization.

The method uses a soft proxy in continuous space, committing tokens only at the end of the inner loop to expose internal signals like rank trajectories and loss curves.
Analysis shows a sharp categorical asymmetry where space-prefixed function words cause failures while content-bearing tokens are recovered almost perfectly.
On 10-token C4 prompts, exact-match rates rise from 66.9% to 97.5% as the candidate window widens, indicating most errors are recoverable near-misses.
The continuous formulation makes the optimization observable and failures detectable, unlike faster per-step hard projection methods like SIPIT.

The results demonstrate that last-layer hidden states of GPT-2 are highly sensitive to input text, allowing for effective recovery of the original sequence.