Inference efficiency
media r/LocalLLaMA · 2d ago

Idea for running GLM2 at decent quant with GPU and DDR3 setup

The user proposes using four 5060 Ti GPUs with 64GB VRAM total, running at PCIe Gen 3, to run GLM2 at a reasonable quantization level. They suggest adding 512GB of DDR3 RAM in a server with 16 PCIe lanes and 4x4 bifurcation to offload KV cache storage, aiming for efficient inference without relying on unified memory clusters. The setup is estimated to cost around $1700 total, with potential viability for GLM2 at a decent quant level.

media r/LocalLLaMA · 3d ago

Qwen3.6-35B-A3B APEX on RTX 3090: Speed and Quality Benchmarks

A benchmark compares llama.cpp forks (ik_llama and spiritbuun) running Qwen3.6-35B-A3B APEX with I-Compact and I-Quality models. ik_llama with I-Compact achieves highest speed (~146 TPS), while spiritbuun with I-Quality and turbo8/turbo4 cache matches this speed and offers slightly better HellaSwag performance. turbo8/turbo4 KV caches outperform q8_0/q5_0, especially at longer contexts, with up to 15% speed gain and lower KLD, making them superior for quality and context length.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.