All articles
media r/LocalLLaMA · 11d ago

Attention Algebra — a grammar that translates natural language into spectrograms

Attention Algebra is a prototype that translates natural language into algebraic expressions, maps them to mathematical dynamics, and visualizes the result as a spectrogram. It treats language as a lossy projection of high-dimensional states, proposing that raw attention patterns grouped into functions serve as the 'DNA' of text, enabling efficient reasoning chains by reducing token usage from 20k to 4k.

media r/LocalLLaMA · 11d ago

Help Running Local Hermes Agent with llama-cpp

A user reports issues running a local Hermes AI agent on a high-end rig using self-compiled llama-cpp. The setup experiences frequent KV cache reprocessing every 5 messages and slow reasoning, with the agent repeatedly pausing to report progress instead of continuing autonomously. The user seeks guidance on whether their llama-cpp parameters are incorrect or what adjustments can improve agent performance and sustained reasoning without interruptions.

media r/LocalLLaMA · 11d ago

Fixing Long-Context Decode Cliff on Radeon R9700 with vLLM 0.22.1

A long-context decode performance cliff on AMD Radeon AI PRO R9700 (RDNA4) was resolved by enabling AITER Unified Attention in vLLM 0.22.1. The fix involves relaxing a CDNA gate to include RDNA4, disabling other attention backends, and using bf16 KV cache, resulting in significant speedups across all context lengths. FP8 KV is ineffective on this hardware, and the model's native 262K context is fully achievable with bf16, offering ~2.9× concurrency without needing FP8.