Inference efficiency
media r/LocalLLaMA · 6d ago

Help Running Local Hermes Agent with llama-cpp

A user reports issues running a local Hermes AI agent on a high-end rig using self-compiled llama-cpp. The setup experiences frequent KV cache reprocessing every 5 messages and slow reasoning, with the agent repeatedly pausing to report progress instead of continuing autonomously. The user seeks guidance on whether their llama-cpp parameters are incorrect or what adjustments can improve agent performance and sustained reasoning without interruptions.

media r/LocalLLaMA · 6d ago

Fixing Long-Context Decode Cliff on Radeon R9700 with vLLM 0.22.1

A long-context decode performance cliff on AMD Radeon AI PRO R9700 (RDNA4) was resolved by enabling AITER Unified Attention in vLLM 0.22.1. The fix involves relaxing a CDNA gate to include RDNA4, disabling other attention backends, and using bf16 KV cache, resulting in significant speedups across all context lengths. FP8 KV is ineffective on this hardware, and the model's native 262K context is fully achievable with bf16, offering ~2.9× concurrency without needing FP8.