A user trained a language model to roll a die, ensuring each number appears approximately once every six rolls. The post highlights how mainstream LLMs tend to default to saying '4' when asked to roll a die, illustrating a broader issue in reinforcement learning: models often fail to explore effectively and instead follow known patterns.
I post-trained a model to reliably roll a die
from English