Small-scale debug comparison of OLMo-core with Engram graft
A 200-step training comparison between a base OLMo3 600M model and a version with a DeepSeek-style Engram graft shows lower training and evaluation loss, faster grad-norm stabilization, and improved early learning behavior. The Engram graft, injected into layers 1 and 5, increases trainable parameters to ~1.7B but maintains only a 40k increase in active parameters per token, indicating efficient memory usage.