Code generation
arxiv arXiv cs.CL · 2d ago

Benchmark Evaluation of Small Language Models for Arabic NLP

A benchmark of 240 Arabic test items across eight domains and ten skills assesses twelve small language models in zero-shot settings. Gemma 3 (12B) achieved the highest overall score (4.548/5), followed by Aya and C4AI Command Arabic, with performance linked more to Arabic alignment and instruction-following than model size. Common failure modes include prompt leakage, hallucination, and weak task adherence.

media r/LocalLLaMA · 3d ago

Same model, same prompt, 4 different agents produce varied code quality

A self-hosted Qwen3.6-27B model with identical prompt and hardware generated four different HTML/JavaScript solar system simulations. The agent scaffolding significantly influenced output: opencode produced clean, stable code with accurate physics; pi showed robustness and coordinate consistency; hermes offered visually appealing but physically flawed results; qwen code generated minimal, crude code. The results highlight how agent design shapes code quality, correctness, and stability despite shared model and prompt.

media r/LocalLLaMA · 3d ago

Qwen3.6-35B-A3B APEX on RTX 3090: Speed and Quality Benchmarks

A benchmark compares llama.cpp forks (ik_llama and spiritbuun) running Qwen3.6-35B-A3B APEX with I-Compact and I-Quality models. ik_llama with I-Compact achieves highest speed (~146 TPS), while spiritbuun with I-Quality and turbo8/turbo4 cache matches this speed and offers slightly better HellaSwag performance. turbo8/turbo4 KV caches outperform q8_0/q5_0, especially at longer contexts, with up to 15% speed gain and lower KLD, making them superior for quality and context length.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.