Inference efficiency
media Latent Space · 6d ago

Why AI Scaling Is a Systems Problem, Not Just a GPU Race

The AI scaling debate overlooks that maximizing model FLOP utilization is more critical than buying more GPUs. Frontiers like xAI operate at sub-10% MFU, while historical models achieved 21% to 70% MFU, indicating systemic inefficiencies in scheduling, networking, and cluster management. Anjney Midha argues that AI infrastructure must evolve into efficient, aligned, and responsible systems, with 'output maxing' emerging as a new discipline for frontier AI.

media r/LocalLLaMA · 6d ago

LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M Released

LFM2.5-Embedding-350M is a dense bi-encoder that provides fast multilingual retrieval with one vector per document, achieving best-in-class accuracy for its size and inference speed comparable to smaller models. LFM2.5-ColBERT-350M is a late interaction retriever with best-in-class multilingual accuracy, enabling cross-lingual retrieval by storing one vector per token and supporting retrieval in multiple languages with high precision. Both models are designed as drop-in replacements for existing RAG pipelines.

media r/LocalLLaMA · 6d ago

Real-world token cost savings from rtk, headroom, and caveman

A real workload analysis shows headroom, rtk, and caveman reduce token costs by 2.8%, 0.5%, and 0.4% respectively, totaling 3.7% of baseline spending. However, savings are limited by payload diversity, with most traffic being plain text or source code, and the tools only compress structured outputs. Most cost reduction occurs on the cheapest token stream—cache reads—while the tools do not affect prompt caching or output costs, and coverage gaps exist, especially for rtk.

arxiv arXiv cs.LG · 7d ago

TransitNet Achieves 95.2% Accuracy in Low-SNR Transit Searches

TransitNet, a compact attention-augmented deep learning framework, achieves 95.2% accuracy in low-SNR transit blind searches, outperforming TLS and BLS in ROC-AUC and PR-AP values. It recovers 93.0% of injected Earth- and sub-Earth-size transits, with 97.4% of injected transits fully covered by estimated transit windows, and successfully recovers all 34 confirmed Kepler planets with a mean midpoint error of 1.24 hours.

arxiv arXiv cs.LG · 7d ago

CAHP: Complementary Attention Head Pruning for Efficient Transformers

CAHP introduces a post-hoc framework that uses graph-theoretical clustering and information-theoretic measures to select complementary attention heads in Transformers. It automatically determines head retention without predefined sparsity, identifying a performance degradation threshold to ensure minimal model loss, and outperforms baselines in high-compression scenarios by preserving functionally critical heads in intermediate layers.

arxiv arXiv cs.AI · 7d ago

TransitNet Achieves 95.2% Accuracy in Low-SNR Transit Searches

TransitNet, a compact attention-augmented deep learning framework, achieves 95.2% accuracy in low-SNR transit blind searches, outperforming TLS and BLS in ROC-AUC and PR-AP values. It recovers 93.0% of injected Earth- and sub-Earth-size transits, with 97.4% of injected transits fully covered by estimated transit windows, and successfully recovers all 34 confirmed Kepler planets with a mean midpoint error of 1.24 hours.