The author presents the 800M version of a model that converts images into controllable characters, designed to run comfortably on consumer GPUs. This iteration increases context to 12 latent frames and improves stability while maintaining high performance, achieving over 60 fps on an RTX 5090.

  • The architecture retains the previous design but features a fattened MLP and a de-noiser trained from scratch with diffusion forcing.
  • The model utilizes causal diffusion where LLMs sample one token per forward pass, storing context in the KV cache.
  • A sliding window evicts intermediate frames to manage the KV cache, as training was limited to approximately 20-30 latent frames.
  • While consistency remains poor, the author aims to address this in future iterations.

The work demonstrates a method for locally generating and controlling character animations on accessible hardware, with further updates shared via the lucidmlx subreddit.