Distill-on-idle pipeline for on-device memory assistant using 4B models

The article details an engineering approach to building a local AI assistant that converts raw screen captures and meeting transcripts into queryable data using only models that run efficiently on laptops. The system leverages Apple's Vision framework for OCR, idle-time distillation of a 4B Gemma model, and hybrid retrieval to avoid performance bottlenecks.

On-device OCR via Apple's Vision framework prevents the LLM from processing pixels directly, improving speed and accuracy.
A 4B-class Gemma model summarizes captures into per-project notes during idle periods, keeping foreground applications responsive.
Retrieval combines SQLite FTS for lexical search with LanceDB for semantic search to capture both exact identifiers and paraphrased content.
The solution relies on tight context retrieval rather than larger models, addressing common failures in local AI assistants.

This architecture allows users to maintain a personal "memory" assistant on macOS + Apple Silicon without draining the battery or stealing GPU resources from active tasks.