The author shares a practical setup for using local large language models on modest hardware, specifically a laptop with 32GB of RAM and an NVIDIA RTX 4070 with 8GB VRAM. The core strategy involves running the Qwen3.6-35B-A3B model locally as a 'small coding agent' while offloading complex planning to a cloud-based GLM 5.2 instance.
- The local Qwen3.6-35B-A3B model runs reliably at approximately 15 tokens per second on battery power, serving as a scoped coding agent for specific tasks.
- A hybrid architecture is used with a 90% local and 10% cloud split, costing under $1 to have GLM 5.2 generate detailed task plans for the local model to execute.
- The user employs pi-coding-agent and llama-server (from llama.cpp) to run the local inference, reviewing all code changes produced by the agent.
- Knowledge gaps are addressed through post-mortems with the model, adding tips to a README file that the agent utilizes in subsequent sessions to improve code quality.
This approach allows for useful coding assistance on ordinary hardware by combining the cost-efficiency of local inference with the reasoning capabilities of a cheaper cloud model for high-level planning.